As the title suggests
Currently, VLLM supports MOE, but does not support quantitative versions. During use, the quantitative version will provide better cost-effectiveness.
Hi @wellcasa, Neural Magic (specifically @ElizaWszola) is working on MoE support for GPTQ models through expansion of the Marlin kernels. It is likely a bit away still, but it is active work!
🚀 The feature, motivation and pitch
As the title suggests Currently, VLLM supports MOE, but does not support quantitative versions. During use, the quantitative version will provide better cost-effectiveness.
Alternatives
As the title suggests
Additional context
As the title suggests