randaller / llama-chat

Chat with Meta's LLaMA models at home made easy
GNU General Public License v3.0
833 stars 118 forks source link

Do you have any plans to support GPTQ-4bit model? #19

Closed forestsource closed 2 weeks ago

forestsource commented 1 year ago

It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa

Danielv123 commented 1 year ago

I'd also like to know how to do this. It seems the primary bottleneck is how fast the layers can be fed to the GPU. My copy load is at 80% while GPU load is at 10%. Is there a way to improve this somehow? I assume if we can get the layers quantized down to 1/4th the size it would be almost 4x faster.

breadbrowser commented 1 year ago

It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa

this is meant for the bare weights