Open iamthemulti opened 1 month ago
@iamthemulti Quantizing the Aria model presents challenges due to its use of grouped-gemm for efficient inference and training with bfloat16, rather than standard nn.Linear layers. The grouped-gemm implementation can be found in the Aria repository: https://github.com/rhymes-ai/Aria/blob/719ff4e52b727443cba3793b0e27fe64e0244fe1/aria/model/moe_lm.py#L444-L482 I'm currently working on a custom solution to address this quantization issue.
@iamthemulti I've uploaded a fork of aria model that replaces the grouped gemm with a sequential mlp, in which each expert is implemented as a torch.nn.Linear
layer executed in sequence. This adjustment simplifies quantization with current open source libraries that are optimized for nn.Linear
layers.
If you want to quantize an aria model, please use rhymes-ai/Aria-sequential_mlp
I am also trying to use some open-source tools to quantize the Aria model, but I'm encountering some issues on the H100. Currently, I don't have access to other GPUs for quantization.
Any updates on quants would be highly valuable @aria-hacker! Please keep us posted about your progress
I got a BitsAndBytes NF4 quant working based on Aria-sequential_mlp here, requires less than 16 GB of VRAM and runs on an RTX 3090
I've uploaded an int8 weight-only model that has been quantized using torchao. It's also compatible with grouped-gemm
. Feel free to try it out if you're interested!
Anyone else getting [ERROR|vllm_server.py:212:3614300] 2024-11-04 11:21:12,223 >> KeyError: 'language_model.layers.27.mlp.experts.experts.61.down_proj.weight’
while loading the MLP model via VLLM?
We have an HQQ 4-bit version working well with just 15GB of VRAM: https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py
Currently having issues attempting to quantize, save, then load the model using HF Transformers.
Is there any known working method for quantizing Aria (preferably to 4bit)?