vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.85k stars 4.51k forks source link

AWQ (Support Mixtral): Implement new `modules_to_not_convert` parameter in config #2243

Closed casper-hansen closed 7 months ago

casper-hansen commented 10 months ago

AutoAWQ now supports Mixtral on the main branch. It requires that we do not quantize the gate in the model. To prevent quantizing and loading it as a quantized linear layer, you have to skip loading the modules_to_not_convert as quantized linear layers.

You can load this 4-bit model in ~24 GB VRAM, but you probably need a bit more for actual KV-caching and inference. I used a 48 GB VRAM GPU for my testing.

  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": [
      "gate"
    ],
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },

Model reference:

https://huggingface.co/casperhansen/mixtral-instruct-awq

ssnl commented 10 months ago

fwiw, currently gate is not quantized in mixtral in vllm https://github.com/vllm-project/vllm/blob/4aaafdd289f57a82513a7742155e4f1b796c8bdc/vllm/model_executor/models/mixtral.py#L131

But probably vllm should not hardcode it.

hmellor commented 7 months ago

Mixtral AWQ works now