turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 273 forks source link

Memory Management and BSOD #337

Closed Annamae-beep closed 6 months ago

Annamae-beep commented 7 months ago

I recently updated Ooba from exllamav2 0.0.11 to current version 0.0.13. Since doing so I’m unable to load models across two GPU's without getting the BSOD. The same issue also occurs in TabbyAPI. Everything works fine if I offload a smaller model onto a single GPU (on either of my GPU's). In order to stop the BSOD and get an error message, I changed the Nvidia driver settings to 'Prefer No Sysmem Fallback'. I am now getting the standard OOM error that Ooba throws up:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 140.00 MiB. GPU 1 has a total capacty of 16.00 GiB of which 10.34 GiB is free. Of the allocated memory 4.46 GiB is allocated by PyTorch, and 43.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Yet the GPU has more than enough available memory. Everything still works fine if I’m loading a GGUF model with any llamacpp backend.

Someone with a similar problem has suggested I downgrade to exllamav2 0.11, but I’m not sure how to do that. So I will have to wait for a fix.

System Info: Windows 10 19045.4046 Build. 2 x Nividia RTX4060 Ti. RAM 64GB. i7-6700 CPU.

Annamae-beep commented 7 months ago

This is a full error message: File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\modules\ui_model_menu.py", line 220, in load_model_wrapper shared.model, shared.tokenizer = load_model(selected_model, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\modules\models.py", line 87, in load_model output = load_func_maploader ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\modules\models.py", line 380, in ExLlamav2_HF_loader return Exllamav2HF.from_pretrained(model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\modules\exllamav2_hf.py", line 170, in from_pretrained return Exllamav2HF(config) ^^^^^^^^^^^^^^^^^^^ File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\modules\exllamav2_hf.py", line 44, in init self.ex_model.load(split) File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\model.py", line 248, in load for item in f: return item File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\model.py", line 266, in load_gen module.load() File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\mlp.py", line 77, in load self.down_proj.load() File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\linear.py", line 45, in load if w is None: w = self.load_weight() ^^^^^^^^^^^^^^^^^^ File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\module.py", line 96, in load_weight qtensors = self.load_multi(["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm"], override_key = override_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\module.py", line 77, in load_multi tensors[k] = stfile.get_tensor(key + "." + k, device = self.device()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\AI\Oobabooga\text-generation-webui-main\text-generation-webui-main\installer_files\env\Lib\site-packages\exllamav2\fasttensors.py", line 118, in get_tensor return f.get_tensor(key) ^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 1 has a total capacty of 16.00 GiB of which 7.14 GiB is free. Of the allocated memory 7.64 GiB is allocated by PyTorch, and 68.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

turboderp commented 7 months ago

You can install any of the previous versions from the releases page, e.g.:

pip uninstall exllamav2
pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.11/exllamav2-0.0.11+cu121-cp311-cp311-win_amd64.whl

BSOD shouldn't happen in any case, so I suspect there's something real fishy going on with the NVIDIA driver. The error message isn't making a lot of sense either. Have you tried different split settings (manual/auto)? Also, since the error seems to be triggered by safetensors, which model is doing this?

Annamae-beep commented 7 months ago

Thank you for the pip command it’s useful to have and it loaded Exlv2 0.0.11, but unfortunately I still had the issue. I ran DDU driver uninstaller in safe mode, twice to be sure before installing an older Nvidia driver. I also stress tested RAM memory, uninstalled recent Window 10 updates, messed around with paging file, virus scan, fresh installs of both Tabby & Ooba. And much more, short of a fresh Windows install which I didn’t want to do. And still the issue wouldn’t go away. Then this morning I booted up my computer, fired up TabbyAPI and Exllamav2 is now working just fine (I’ll try Ooba later but I suspect that will also be back to normal). The problem may well have been caused by a corrupt Nvidia driver; to be honest I really don’t know. Anyway it’s working and I’m very grateful for your advice and all the work you do for the community, I appreciate it. Thanks again.