Closed leockl closed 5 months ago
vLLM support GPTQ and AWQ quantization. You can use https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
@simon-mo ok many thanks for this! Sorry dumb question, do you roughly know what's the difference between the AWQ quant model provided by TheBloke vs. the 4-bit model that you can set using the the following line of code in the Hugging Face repo for Mixtral-8x7B-Instruct-v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1):
+ model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
I believe the load_in_4bit uses the wonderful bitsandbytes library https://huggingface.co/docs/bitsandbytes/main/en/index. However, it does not perform calibration and might have worse accuracy on the calibrated GPTQ/AWQ variant.
Hi @simon-mo I looked at vllm document and also #392 but didn't find any confirm on bnb
support. Am I missing anything or it just VLLM doesn't support it yet?
We do not support bitsandbytes. We are recommending GPT-Q equivalent which have same footprint and higher accuracy.
@simon-mo ok great got it, thanks heaps for your help!
We do not support bitsandbytes. We are recommending GPT-Q equivalent which have same footprint and higher accuracy.
Would it be easy to implement bitsandbytes support? There are new techniques which allow quantized-aware training such as QLoRA, which might perform even better than data dependent calibration quantization techniques, at least for fine-tuned models.
bnb
has been requested in #4033
Yes, GPTQ is supported
Hey guys,
Does vLLM support the 4-bit quantized version of the Mixtral-8x7B-Instruct-v0.1 model downloaded from Hugging Face here https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1.
According to the Hugging Face link above, we can switch over to the 4-bit version using this line of code:
This is urgent so I would really appreciate any help on this.