Closed pseudotensor closed 7 months ago
While I agree something like this is interesting, a model would have to be trained with this feature to support it in vllm. If a model emerges with a feature like this, we could try to implement it
It might be worthwile to read the recent paper from DeepMind on Mixture of Depths, which covers an adjacent idea of dynamically determining the amount of compute at runtime
🚀 The feature, motivation and pitch
Often times we want to balance performance and speed. One could use Mixtral + Mistral 7B in a setup and use separate GPUs etc. However, this is wasteful.
It would be more efficient if Mixtral could be triggered to use a single 7B expert and somehow be as efficient as a single 7B if possible. If the attention layers still make it computationally less efficient than just a single 7B, then probably not best idea.
Alternatives
2 models: Mixtral + Mistral 7B
Additional context
No response