Closed bibidentuhanoi closed 7 months ago
There is no built-in way, no. If your NVIDIA driver supports system RAM swapping, that's a way to run larger models than you could otherwise fit in VRAM, but it's going to be horrendously slow. You may be better off running GGUF models in llama.cpp, offloading what you can onto the GPU but doing CPU inference for the rest.
Thank you. I have tried using llama.cpp, but the output from that model is pretty unsusable(even the q8_0) compared to the Exl2 model. Anyway, what an amazing project you have here.
Is there any way that I can offload the model to the CPU? My machine has limited resources (32 GB of RAM and 6 GB of VRAM). Thank you.