vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.63k stars 4.07k forks source link

Does vLLM support the 4-bit quantized version of the Mixtral-8x7B-Instruct-v0.1 model downloaded from Hugging Face #3128

Closed leockl closed 5 months ago

leockl commented 7 months ago

Hey guys,

Does vLLM support the 4-bit quantized version of the Mixtral-8x7B-Instruct-v0.1 model downloaded from Hugging Face here https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1.

According to the Hugging Face link above, we can switch over to the 4-bit version using this line of code:

+ model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")

This is urgent so I would really appreciate any help on this.

simon-mo commented 7 months ago

vLLM support GPTQ and AWQ quantization. You can use https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ

leockl commented 7 months ago

@simon-mo ok many thanks for this! Sorry dumb question, do you roughly know what's the difference between the AWQ quant model provided by TheBloke vs. the 4-bit model that you can set using the the following line of code in the Hugging Face repo for Mixtral-8x7B-Instruct-v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1):

+ model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
simon-mo commented 7 months ago

I believe the load_in_4bit uses the wonderful bitsandbytes library https://huggingface.co/docs/bitsandbytes/main/en/index. However, it does not perform calibration and might have worse accuracy on the calibrated GPTQ/AWQ variant.

ann-lab52 commented 7 months ago

Hi @simon-mo I looked at vllm document and also #392 but didn't find any confirm on bnb support. Am I missing anything or it just VLLM doesn't support it yet?

simon-mo commented 7 months ago

We do not support bitsandbytes. We are recommending GPT-Q equivalent which have same footprint and higher accuracy.

leockl commented 7 months ago

@simon-mo ok great got it, thanks heaps for your help!

KeremTurgutlu commented 6 months ago

We do not support bitsandbytes. We are recommending GPT-Q equivalent which have same footprint and higher accuracy.

Would it be easy to implement bitsandbytes support? There are new techniques which allow quantized-aware training such as QLoRA, which might perform even better than data dependent calibration quantization techniques, at least for fine-tuned models.

hmellor commented 5 months ago

bnb has been requested in #4033

hmellor commented 4 months ago

Yes, GPTQ is supported