turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

dbrx doesn't respect gpu_split, OOMs on the first GPU no matter what #390

Closed tdrussell closed 6 months ago

tdrussell commented 6 months ago

Using the latest commit d3184ec, I was able to make my own 4bpw quant of dbrx-instruct. I am running into problems trying to load the model in text-generation-webui (using that same commit of exllamav2). I'm using a machine with 4 4090s. It seems to not respect the gpu_split parameter. I can set it to auto, or 22,22,22,22, or 10,22,22,22, or even 1,1,1,1, and the result is the same. It completely fills up the first GPU's VRAM and OOMs. None of the other 3 GPUs ever get anything loaded on it. Tested with other exl2 models, and they all accurately adhere to the gpu_split I configure.

RandomInternetPreson commented 6 months ago

I can confirm the same behavior, to add to the original observation. I can use the the chat.py file in the examples folder of the exllamav2 repo to successfully load a self quantized version of dbrx across multiple gpus.

However when textgen is updated with the latest version of exllama, I experience the same error as the op, my first gpu will fill up and none of the other gpus will try to load the model:

18:39:24-360904 INFO Loading "dbrx-instruc4bit"
18:39:27-779103 ERROR Failed to load the model.
Traceback (most recent call last): File "/home/myself/Desktop/OobMar27/text-generation-webui/modules/ui_model_menu.py", line 245, in load_model_wrapper shared.model, shared.tokenizer = load_model(selected_model, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/myself/Desktop/OobMar27/text-generation-webui/modules/models.py", line 87, in load_model output = load_func_maploader ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/myself/Desktop/OobMar27/text-generation-webui/modules/models.py", line 373, in ExLlamav2_loader model, tokenizer = Exllamav2Model.from_pretrained(model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/myself/Desktop/OobMar27/text-generation-webui/modules/exllamav2.py", line 60, in from_pretrained model.load(split) File "/home/myself/Desktop/OobMar27/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 302, in load for item in f: x = item File "/home/myself/Desktop/OobMar27/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/model.py", line 325, in load_gen module.load() File "/home/myself/Desktop/OobMar27/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/moe_mlp.py", line 103, in load self.w3[e].load() File "/home/myself/Desktop/OobMar27/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/linear.py", line 90, in load if w is None: w = self.load_weight() ^^^^^^^^^^^^^^^^^^ File "/home/myself/Desktop/OobMar27/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/module.py", line 106, in load_weight qtensors = self.load_multi(key, ["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm", "bias"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/myself/Desktop/OobMar27/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/module.py", line 86, in load_multi tensors[k] = stfile.get_tensor(key + "." + k, device = self.device()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/myself/Desktop/OobMar27/text-generation-webui/installer_files/env/lib/python3.11/site-packages/exllamav2/fasttensors.py", line 204, in get_tensor tensor = f.get_tensor(key) ^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 18.56 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 22.52 GiB is allocated by PyTorch, and 665.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

RandomInternetPreson commented 6 months ago

A solution I just found is to use the "ExLlamav2_HF" loader not the "ExLlamav2" loader, for some reason the HF loader works and the non HF loader will try to load everything to a single gpu.

tdrussell commented 6 months ago

I was on a 13-day-old commit of ooba's webui. After pulling and updating all requirements (but keeping exllamav2 at head), it now works. Both exllamav2_hf and exllamav2 loaders work for me, and split the model across GPUs according to the gpu_split. I dunno what changed, there have been no commits in either exllama loader file since the commit I was using... Anyway, I'll go ahead and close this.

TLDR: make sure to update text-generation-webui and all requirements.