laekov / fastmoe

A fast MoE impl for PyTorch
https://fastmoe.ai
Apache License 2.0
1.52k stars 185 forks source link

Adding Expert Prototyping to FastMoE #69

Open JustinLin610 opened 3 years ago

JustinLin610 commented 3 years ago

Hi, thanks for your provding end-to-end training framework in Pytorch for MoE models. We have recently implemented MoE in tensorflow and found out that categorizing experts to different groups can bring improvements in model quality. More details can be referred to our paper https://arxiv.org/abs/2105.15082. I wonder if it is possible to add this feature as FastMoE really facilitates research in sparse expert models.

Generally, this strategy categorizes experts to different groups, each of which has its own gating function for routing. It is compatible with the conventional routing method like Switch or top-2 routing as you can set the group number to 1. We find that increasing the value of k in top-k can increase model performance and k top-1 can achieve similar effect. Also, it is possible to try out more complex strategies, say k top-k' or so.

We have a code snippet in the appendix, which may be helpful.

xptree commented 3 years ago

Here is another recent work about MoE.

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning https://arxiv.org/abs/2106.03760

The idea is to activate all experts at the beginning of training, but quickly converge to sparse activation. I wonder whether such mechanism can help train better pre-trained models when our expert pool is not that large.

Let me know how do you think about it?