Open clumsy opened 1 year ago
The authors claim 2x convergence rate with EC routing: https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html
I hope this incentivizes implementing it in DeepSpeed.
Thank you @clumsy for sharing this paper.
@ykim362, have you seen this paper? Is anyone in your team or any interns interested in implementing this feature?
Hi @awan-10 . I have an implementation of this paper. But, we didn't see the gains mentioned in the paper. Actually, the accuracy was quite worse than the original top-1 and top-2 gating.
@clumsy have you actually done any experiments with this expert choice gating?
No @ykim362, but I would like to experiment with it and share the results. Is it possible to share the snippet with the implementation you used?
@clumsy you can take a look at this experimental branch. https://github.com/ykim362/DeepSpeed/tree/youki/expc
hey, google has implementation of expert choice routing here: https://github.com/google/flaxformer/blob/main/flaxformer/architectures/moe/routing.py#L647-L717
They have a note that it should not be used in decoder blocks, maybe that was reason for poor results during your experiments?
Is your feature request related to a problem? Please describe. A paper was published regarding potentially better token-expert routing for MoE that leaves less experts under-trained.
Describe the solution you'd like In addition to GShard's top2 and SwitchTransformer's top1 per token expert routing add expert choice routing option.
Describe alternatives you've considered N/A
Additional context N/A