Closed forestsource closed 2 weeks ago
I'd also like to know how to do this. It seems the primary bottleneck is how fast the layers can be fed to the GPU. My copy load is at 80% while GPU load is at 10%. Is there a way to improve this somehow? I assume if we can get the layers quantized down to 1/4th the size it would be almost 4x faster.
It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa
this is meant for the bare weights
It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa