vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.98k stars 4.71k forks source link

[Feature]: MoE (e.g. Mixtral) dyanmically choose number of experts at runtime #3838

Closed pseudotensor closed 7 months ago

pseudotensor commented 7 months ago

🚀 The feature, motivation and pitch

Often times we want to balance performance and speed. One could use Mixtral + Mistral 7B in a setup and use separate GPUs etc. However, this is wasteful.

It would be more efficient if Mixtral could be triggered to use a single 7B expert and somehow be as efficient as a single 7B if possible. If the attention layers still make it computationally less efficient than just a single 7B, then probably not best idea.

Alternatives

2 models: Mixtral + Mistral 7B

Additional context

No response

robertgshaw2-neuralmagic commented 7 months ago

While I agree something like this is interesting, a model would have to be trained with this feature to support it in vllm. If a model emerges with a feature like this, we could try to implement it

It might be worthwile to read the recent paper from DeepMind on Mixture of Depths, which covers an adjacent idea of dynamically determining the amount of compute at runtime