oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.68k stars 5.21k forks source link

Cuda Out of Memory...? #1274

Closed NotBillyJoel357 closed 1 year ago

NotBillyJoel357 commented 1 year ago

Traceback (most recent call last): File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\text-generation-webui[server.py](http://server.py/)”, line 85, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\text-generation-webui\modules[models.py](http://models.py/)”, line 100, in load_model model = load_quantized(model_name) File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py”, line 168, in load_quantized model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True) File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\big_modeling.py”, line 370, in dispatch_model attach_align_device_hook_on_blocks( File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\installer_files\env\lib\site-packages\accelerate[hooks.py](http://hooks.py/)”, line 478, in attach_align_device_hook_on_blocks add_hook_to_module(module, hook) File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\installer_files\env\lib\site-packages\accelerate[hooks.py](http://hooks.py/)”, line 155, in add_hook_to_module module = hook.init_hook(module) File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\installer_files\env\lib\site-packages\accelerate[hooks.py](http://hooks.py/)”, line 251, in init_hook set_module_tensor_to_device(module, name, self.execution_device) File “C:\Users\andri\Desktop\TavernAI-main\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils[modeling.py](http://modeling.py/)”, line 147, in set_module_tensor_to_device new_value = old_value.to(device) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 4.00 GiB total capacity; 3.56 GiB already allocated; 0 bytes free; 3.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Please help.

corndoghh commented 1 year ago

You need a better GPU with more VRAM

NotBillyJoel357 commented 1 year ago

Any way to do it without a better GPU?

kuso-ge commented 1 year ago

Tick the CPU option, it's gonna be painfully slow though.

practical-dreamer commented 1 year ago

With only 4GB of VRAM @bloodsign is probably right... you'll be OOM with most anything. Regular offloading to CPU is usually pretty slow with one exception... Llama.cpp

I'd recommend looking at this page to see if it fits your intended use https://github.com/oobabooga/text-generation-webui/wiki/llama.cpp-models

Shadowgar commented 1 year ago

I'm running into similar issues. I'm using a RTX3060 with 6gb of ram and having the same error. Is there no way around this?

Sourdface commented 1 year ago

I started getting this today after reinstalling. (Accidentally deleted my WSL install. Whoops.) It wasn't a problem before.

I'm running on Windows 11 inside WSL with an RTX 3080 (12GB). It doesn't happen for the first few generations, then suddenly I only get a single character/token out and it's out of memory.

YeiSimon commented 1 year ago

After prompting, you need to kill the process to release the cuda

image

link:https://stackoverflow.com/questions/50193538/how-to-kill-process-on-gpus-with-pid-in-nvidia-smi-using-keyword

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.