vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.77k stars 3.41k forks source link

May I ask when the qwen moe quantization version is supported, preferably using auto gptq or awq. #5202

Open wellcasa opened 1 month ago

wellcasa commented 1 month ago

🚀 The feature, motivation and pitch

As the title suggests Currently, VLLM supports MOE, but does not support quantitative versions. During use, the quantitative version will provide better cost-effectiveness.

Alternatives

As the title suggests

Additional context

As the title suggests

mgoin commented 1 month ago

Hi @wellcasa, Neural Magic (specifically @ElizaWszola) is working on MoE support for GPTQ models through expansion of the Marlin kernels. It is likely a bit away still, but it is active work!