gptq 4-bit quantized version

project-baize / baize-chatbot

Let ChatGPT teach your own chatbot in hours with a single GPU!

https://arxiv.org/abs/2304.01196

GNU General Public License v3.0

3.15k stars 275 forks source link

gptq 4-bit quantized version #17

Open regstuff opened 1 year ago

regstuff commented 1 year ago

Hi, Do you guys have any plans to make a gptq 4-bit quantized version of your models. That would cut VRAM usage and improve inference speed a lot, without much loss in capabilities. A lot of other llama/alpaca models are doing this. I'd do it myself but I don't have the kind of RAM needed for a conversion. Thanks for this great model. Please keep going!

alxfoster commented 1 year ago

+1 I'd love to see this too. If helpful, I have access to a sufficiently capable machine (Ubuntu/28c/168Gb/132GbSwap/NVlink2x3090/48gb-Vram) and would be willing to provide the compute if anyone can draft a detailed guide or help w. setup to quantize the 65B LLaMa in 4-bit / 128g

lolxdmainkaisemaanlu commented 1 year ago

This would be really helpful!

davidliudev commented 1 year ago

+1. This would be great if this can be done.