turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.22k stars 238 forks source link

CUDA out of memory, but that doesn't seem to be true #268

Closed Sebastianv650 closed 3 weeks ago

Sebastianv650 commented 6 months ago

Hello, I can load exl2 models in general, using a RTX3050. But now with a just slightly bigger model (LoneStriker_SauerkrautLM-UNA-SOLAR-Instruct-4.0bpw-h6-exl2) at 5,4GB file size I get a strange OOM error despite only 5,4GB of the 8GB are used:

CUDA out of memory. Tried to allocate 38.00MiB. GPU 0 has a total capacity of 8.00 GiB of which 1.46 GiB is free. Of the allocated memory 5.35GiB is allocated by PyTorch, and 84.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_size_mb to avoid fragmentation.

As only 84MB are unallocated I don't think the suggestion in the error message is applicable. Any ideas what I could try?

I'm running Windows 10 and oobabooga WebUI.

turboderp commented 6 months ago

Loading a model takes somewhat more VRAM than the size of the weights on disk. There's also a cache (context) and a number of buffers required for inference.

Sadly, Torch's out-of-memory messages are a kind of useless. It'll usually say it failed to allocate a small amount of VRAM while a lot more is "free". But in reality that memory isn't free. One way to check would be opening Task Manager to see the actual VRAM usage as the model is loading.

SOLAR seems like it should just about fit on a 3050, but it's a tight fit so maybe you can troubleshoot by reducing the context size a little.

Sebastianv650 commented 6 months ago

Taskmanager confirms the 5.4GB reported by the error message. I'm aware of the space needed for caches and context, but my guess was that it might work as I'm able to use for example a 3bpw exl2 model of Tiefighter 13B with 4k context and 8bit cache enabled. In this case, the VRAM is about 7.5GB used.

With the LoneStriker_SauerkrautLM-UNA-SOLAR-Instruct-4.0bpw-h6-exl2 model, it fails even at 1k context and 8 bit cache.

If you think it's a real out of VRAM thing, than it's OK and I have to stick to smaller models.

Edit: The Model with problems is 5.4GB. Which is the exakt same size as the bartowski_OpenHermes-2.5-Mistral-7B-exl2_6.0 I also use without memory issues. So again, I can't prove it, but I guess there might be a bug somewhere..

DocShotgun commented 6 months ago

Edit: The Model with problems is 5.4GB. Which is the exakt same size as the bartowski_OpenHermes-2.5-Mistral-7B-exl2_6.0 I also use without memory issues. So again, I can't prove it, but I guess there might be a bug somewhere..

IIUC, the cache size wouldn't be the same between the two models, despite the quantized weights being the same size, due to the SOLAR model having a greater number of layers.