[Feature]: MoE (e.g. Mixtral) dyanmically choose number of experts at runtime

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

30.98k stars 4.71k forks source link

🚀 The feature, motivation and pitch

Often times we want to balance performance and speed. One could use Mixtral + Mistral 7B in a setup and use separate GPUs etc. However, this is wasteful.

It would be more efficient if Mixtral could be triggered to use a single 7B expert and somehow be as efficient as a single 7B if possible. If the attention layers still make it computationally less efficient than just a single 7B, then probably not best idea.

Alternatives

2 models: Mixtral + Mistral 7B

Additional context

No response

vllm-project / vllm