oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.68k stars 5.21k forks source link

Exllama v2 crashes when starting to load in the third gpu #6305

Open ciprianveg opened 1 month ago

ciprianveg commented 1 month ago

Describe the bug

Exllama v2 crashes when starting to load in the third gpu. No matter if the order is 3090,3090,A4000 or A4000,3090,3090, when I try to load the Mistral Large 2407 exl2 3.0bpw it crashes after filling the first 2 gpus and when it should start loading the rest of the model in the third gpu. There are 64gb vram so it should work fine. Loading the model as gguf is working, but I prefer exl2. I am using it via oobabooga and is updated to latest version..

Is there an existing issue for this?

Reproduction

try to load the Mistral Large 2407 exl2 3.0bpw on 3 gpus, if possible 3090, 3090, A4000 16gb.

Screenshot

No response

Logs

No logs, oogabooga crashes with press any key to continue in dos window (pause).

System Info

5900x AM4 64gb ram, 2x3080 1xA4000. Windows 11, latest nvidia drivers
ciprianveg commented 1 month ago

Same with exllamaV2_HF

ciprianveg commented 1 month ago

With llama3.1 turboderp 70b 6.9bpw i get: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.00 GiB. GPU 2 has a total capacity of 24.00 GiB of which 22.54 GiB is free. Of the allocated memory 155.58 MiB is allocated by PyTorch, and 30.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

Kaszebe commented 1 month ago

+1

I have 120GB of VRAM and getting CUDA OOM when trying to load a ~50GB EXL2 quant.

ciprianveg commented 1 month ago

installing old nvidia drivers for windows fixed my issue: 545.92