turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

CPU offloading #225

Closed bibidentuhanoi closed 7 months ago

bibidentuhanoi commented 7 months ago

Is there any way that I can offload the model to the CPU? My machine has limited resources (32 GB of RAM and 6 GB of VRAM). Thank you.

turboderp commented 7 months ago

There is no built-in way, no. If your NVIDIA driver supports system RAM swapping, that's a way to run larger models than you could otherwise fit in VRAM, but it's going to be horrendously slow. You may be better off running GGUF models in llama.cpp, offloading what you can onto the GPU but doing CPU inference for the rest.

bibidentuhanoi commented 7 months ago

Thank you. I have tried using llama.cpp, but the output from that model is pretty unsusable(even the q8_0) compared to the Exl2 model. Anyway, what an amazing project you have here.