Closed Syst3m1cAn0maly closed 2 months ago
Hey @Syst3m1cAn0maly we don't support quantization in vLLM for non-Mixtral MoEs yet. We are currently undergoing a refactor to support Qwen2 and DeepSeek-V2 https://github.com/vllm-project/vllm/pull/6088
Thank you for the efforts. Looking forward to FP8 support for DSv2❤️
Thank you for the efforts. Looking forward to FP8 support for DSv2❤️
Working on this today
This should do it for you:
Thanks a lot. I will try as soon as possible. Do I need to change the settings to quantize this model properly with AutoFP8 or should it work as-is ? (I saw there was a specific setting for Mixtral models regarding MoE gates)
You need to skip the routing gate:
# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(
quant_method="fp8",
activation_scheme="static",
# skip the lm head and expert gate
ignore_patterns=["re:.*lm_head", "re:.*gate.weight"],)
The other thing I'm not sure about is the following layers:
self_attn.kv_a_proj_with_mqa
self_attn.kv_b_proj
Im working on seeing how sensitive they are now
Thanks, I will try with these settings.
FYI - config above is good. But needed one more tweak on vllm side.
@robertgshaw2-neuralmagic thanks a lot for the work
I tested today and it now works as expected, thanks !
Hi !
I quantized DeepSeek-Coder-V2-Lite-Instruct to FP8 using AutoFP8 but when I try to run it with vLLM I get the following error :
RuntimeError: "cat_cuda" not implemented for 'Float8_e4m3fn'
I ran the quantization using this script :
and I got the following output :
What can I do to quantize correctly this kind of model ?