AWQ (Support Mixtral): Implement new `modules_to_not_convert` parameter in config

casper-hansen commented 10 months ago

AutoAWQ now supports Mixtral on the main branch. It requires that we do not quantize the gate in the model. To prevent quantizing and loading it as a quantized linear layer, you have to skip loading the modules_to_not_convert as quantized linear layers.

You can load this 4-bit model in ~24 GB VRAM, but you probably need a bit more for actual KV-caching and inference. I used a 48 GB VRAM GPU for my testing.

  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": [
      "gate"
    ],
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },

Model reference:

https://huggingface.co/casperhansen/mixtral-instruct-awq

ssnl commented 10 months ago

fwiw, currently gate is not quantized in mixtral in vllm https://github.com/vllm-project/vllm/blob/4aaafdd289f57a82513a7742155e4f1b796c8bdc/vllm/model_executor/models/mixtral.py#L131

But probably vllm should not hardcode it.

hmellor commented 7 months ago

Mixtral AWQ works now

vllm-project / vllm

AWQ (Support Mixtral): Implement new `modules_to_not_convert` parameter in config #2243