Open ciprianveg opened 1 month ago
Same with exllamaV2_HF
With llama3.1 turboderp 70b 6.9bpw i get: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.00 GiB. GPU 2 has a total capacity of 24.00 GiB of which 22.54 GiB is free. Of the allocated memory 155.58 MiB is allocated by PyTorch, and 30.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
+1
I have 120GB of VRAM and getting CUDA OOM when trying to load a ~50GB EXL2 quant.
installing old nvidia drivers for windows fixed my issue: 545.92
Describe the bug
Exllama v2 crashes when starting to load in the third gpu. No matter if the order is 3090,3090,A4000 or A4000,3090,3090, when I try to load the Mistral Large 2407 exl2 3.0bpw it crashes after filling the first 2 gpus and when it should start loading the rest of the model in the third gpu. There are 64gb vram so it should work fine. Loading the model as gguf is working, but I prefer exl2. I am using it via oobabooga and is updated to latest version..
Is there an existing issue for this?
Reproduction
try to load the Mistral Large 2407 exl2 3.0bpw on 3 gpus, if possible 3090, 3090, A4000 16gb.
Screenshot
No response
Logs
System Info