rhymes-ai / Aria

Codebase for Aria - an Open Multimodal Native MoE
Apache License 2.0
838 stars 70 forks source link

How to quantize the model? #32

Open iamthemulti opened 1 month ago

iamthemulti commented 1 month ago

Currently having issues attempting to quantize, save, then load the model using HF Transformers.

Is there any known working method for quantizing Aria (preferably to 4bit)?

aria-hacker commented 1 month ago

@iamthemulti Quantizing the Aria model presents challenges due to its use of grouped-gemm for efficient inference and training with bfloat16, rather than standard nn.Linear layers. The grouped-gemm implementation can be found in the Aria repository: https://github.com/rhymes-ai/Aria/blob/719ff4e52b727443cba3793b0e27fe64e0244fe1/aria/model/moe_lm.py#L444-L482 I'm currently working on a custom solution to address this quantization issue.

aria-hacker commented 1 month ago

@iamthemulti I've uploaded a fork of aria model that replaces the grouped gemm with a sequential mlp, in which each expert is implemented as a torch.nn.Linear layer executed in sequence. This adjustment simplifies quantization with current open source libraries that are optimized for nn.Linear layers.

If you want to quantize an aria model, please use rhymes-ai/Aria-sequential_mlp

I am also trying to use some open-source tools to quantize the Aria model, but I'm encountering some issues on the H100. Currently, I don't have access to other GPUs for quantization.

DenisSergeevitch commented 1 month ago

Any updates on quants would be highly valuable @aria-hacker! Please keep us posted about your progress

leon-seidel commented 3 weeks ago

I got a BitsAndBytes NF4 quant working based on Aria-sequential_mlp here, requires less than 16 GB of VRAM and runs on an RTX 3090

aria-hacker commented 3 weeks ago

I've uploaded an int8 weight-only model that has been quantized using torchao. It's also compatible with grouped-gemm. Feel free to try it out if you're interested!

ntoxeg commented 2 weeks ago

Anyone else getting [ERROR|vllm_server.py:212:3614300] 2024-11-04 11:21:12,223 >> KeyError: 'language_model.layers.27.mlp.experts.experts.61.down_proj.weight’ while loading the MLP model via VLLM?

mobicham commented 1 week ago

We have an HQQ 4-bit version working well with just 15GB of VRAM: https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py