Full use of dual GPU - Githubissues

Skit5 commented 4 months ago

Hi guys! Love your work but I bought another GPU recently for the release of Llama 3 and hit a wall.

Description Like a lot of small wallet devs, I have a dual RTX 3090. I was expecting to use it to load Llama-3-70B-Instruct in q4 but, using ExLlama, I have an "RuntimeError: Insufficient VRAM for model and cache" error.

Using HF Transformers, I'm limited by a tiny context length (<2k). The reason is: GPU0 memory is used at 93% but GPU1 memory is only used at 67%. From an article from Answer.AI, I learnt about FSDP and thought its sharding method might balance the load but it doesn't seem supported by text-gen-web-ui.

Therefore I'm a bit lost on this; is it a known limitation or is there something I am missing that could improve the context length and performance?

Kind regards

Additional Context

`01:08:16-129674 INFO Loading "Meta-Llama-3-70B-Instruct"
01:08:16-149009 INFO TRANSFORMERS_PARAMS=
{ 'low_cpu_mem_usage': True, 'torch_dtype': torch.float16, 'device_map': 'auto', 'max_memory': {0: '23600MiB', 1: '24200MiB', 'cpu': '6800MiB'}, 'quantization_config': BitsAndBytesConfig { "_load_in_4bit": true, "_load_in_8bit": false, "bnb_4bit_compute_dtype": "float16", "bnb_4bit_quant_storage": "uint8", "bnb_4bit_quant_type": "nf4", "bnb_4bit_use_double_quant": true, "llm_int8_enable_fp32_cpu_offload": true, "llm_int8_has_fp16_weight": false, "llm_int8_skip_modules": null, "llm_int8_threshold": 6.0, "load_in_4bit": true, "load_in_8bit": false, "quant_method": "bitsandbytes" } }

Loading checkpoint shards: 100%|████████████████| 30/30 [03:35<00:00, 7.18s/it] 01:11:55-407335 INFO Loading the tokenizer with use_fast=False.
01:11:55-775868 INFO Loaded "Meta-Llama-3-70B-Instruct" in 219.65 seconds.
01:11:55-777241 INFO LOADER: "Transformers"
01:11:55-778368 INFO TRUNCATION LENGTH: 8192
01:11:55-779213 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model
metadata)"`

Ph0rk0z commented 4 months ago

Experiment with max memory and set it lower until it fills more of the other GPU first. It will come back around once it reaches your 2nd gpu's limit.

Skit5 commented 4 months ago

Experiment with max memory and set it lower until it fills more of the other GPU first. It will come back around once it reaches your 2nd gpu's limit.

Thank you for your response. But I don't get it; if I allocate less memory, it won't simply use it. What do you mean exactly?

nicoboss commented 4 months ago

It will use more memory than the limit you set. How much more depends on settings like the context length. I use the following memory limits for WizardLM-2 8x22B exl2 3.0bpw to fill every last bit of GPU memory - note how the first GPU has a lower limit than the others:

Context length: 8000
GPU 1: RTX 4090: 21 GiB limit
GPU 2: RTX 4090: 23 GiB limit
GPU 3: RTX 3080: 9 GiB limit

Skit5 commented 4 months ago

I tried, even after upgrading to the latest version of the driver (535->550), but it still loads ~67% on GPU1 then starts loading in GPU0 up to ~93%. I don't seem to be able to fix it whatever I try. Any advice is welcome.

Ph0rk0z commented 4 months ago

Keep trying lower limits until you get less on the first GPU. It can be like 15gb even.

joo0ne commented 2 months ago

I also have 2 GPUs and also have problems with loading, perhaps for the same reason. My second card is underloaded. I found the tensor_split parameter. However, it is described in different ways:

In the interface this is: List of proportions to split the model across multiple GPUs. Example: 18, 17.
On Wiki it is: Sets the amount of memory to allocate per GPU as proportions. Not to be confused with other loaders where this is set in GB; here you can set something like 30.70 for 30%/70%.

Obviously the descriptions contradict each other.

I really can't understand how it works. It looks like the note on the wiki is more accurate (i.e. the sum should be 100%), but without taking into account the memory for the context.

Skit5 commented 2 months ago

Glad to see I'm not the only one with this complaint. Although it's really a Transformers issue. With Llama_cpp (using gguf), I can fully load it. But then that's limiting because I can't use the raw model or make a LoRA with that approach.

Skit5 commented 1 month ago

Ticket opened on HF Transformers as it's probably an issue on their side: https://github.com/huggingface/transformers/issues/32412

oobabooga / text-generation-webui

Full use of dual GPU #6028