Closed casper-hansen closed 7 months ago
fwiw, currently gate
is not quantized in mixtral in vllm https://github.com/vllm-project/vllm/blob/4aaafdd289f57a82513a7742155e4f1b796c8bdc/vllm/model_executor/models/mixtral.py#L131
But probably vllm should not hardcode it.
Mixtral AWQ works now
AutoAWQ now supports Mixtral on the main branch. It requires that we do not quantize the
gate
in the model. To prevent quantizing and loading it as a quantized linear layer, you have to skip loading themodules_to_not_convert
as quantized linear layers.You can load this 4-bit model in ~24 GB VRAM, but you probably need a bit more for actual KV-caching and inference. I used a 48 GB VRAM GPU for my testing.
Model reference:
https://huggingface.co/casperhansen/mixtral-instruct-awq