microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

[Question] Comparison to FasterMoE #232

Open Guodanding opened 4 months ago

Guodanding commented 4 months ago

Hello! I am a freshman of MoE. And I am interesting in the following question:

What do you think of the differences of Tutel (or Megatron-DeepSpeed, use dp+tp+ep in MoE layers) and Fast/FasterMoE. In my opinion, Tutel is better at scalability, as it uses a fixed but searchable parallel solution, while FasterMoE is more elegant and fine-grained, but not good at scalability, because it is fine-grained and introduces other communication(cost of shadow) and mess up the ep+tp+dp communication. (I don't know) And maybe in a limited resources situation, FasterMoE can do better?

Please correct me if I misunderstand someting! :)

ghostplant commented 4 months ago

The main difference is what assumption to base on.

The assumption of Tutel MoE is no assumption. e.g. Allowing switching execution approaches during runtime, without influencing designed accuracy, no extra penaty of doing ANY switching.

FasterMoE assumes it can launch the expert data migration decision which may take extra time to complete, thus, when MoE-based model trend to choose fixed subset of experts, this migration decision and cost could pay off.

FasterMoE also assumes to do intra-node all2all occasionally even cross-node all2all is expect, because it could save more inter-node bandwidth and thus having better throughput. The penalty is that some models may have accuracy drop since inter-node information are less exchanged.

Guodanding commented 4 months ago

The main difference is what assumption to base on.

The assumption of Tutel MoE is no assumption. e.g. Allowing switching execution approaches during runtime, without influencing designed accuracy, no extra penaty of doing ANY switching.

FasterMoE assumes it can launch the expert data migration decision which may take extra time to complete, thus, when MoE-based model trend to choose fixed subset of experts, thus, when MoE-based model tends to choose fixed subset of experts, this migration cost could pay off.

FasterMoE also assumes to do intra-node all2all occasionally even cross-node all2all is expect, because it could save more inter-node bandwidth and thus having better throughput. The penalty is that some models may have accuracy drop when used.

Does that mean Tutel focus on a more general situation, and FasterMoE focus on and do better at a special situation?

ghostplant commented 4 months ago

Tutel only integrates math-equivalent optimizations for standard MoE algorithm, while FasterMoE explores algorithm-wise changes as well as data-wise prediction of expert selection, expecting to use both 2 ideas to achieve less e2e training time in addition to comparable accuracy. In other words, the gain from Tutel benefits general situation for sure, while the gain from FasterMoE depends on experimental factors, e.g. predictor accuracy / weight migration penalty / dataset specialty / all2all differences, etc. When these factors work well together, it could be a lot faster than standard MoE algorithm.

Guodanding commented 4 months ago

In other words, the gain from Tutel benefits general situation for sure, while the gain from FasterMoE depends on experimental factors, e.g. predictor accuracy / weight migration penalty / dataset specialty / all2all differences, etc. When these factors work well together, it could be a lot faster than standard MoE algorithm.

I get the point. Thanks!

while FasterMoE explores algorithm-wise changes as well as data-wise prediction of expert selection

Does algorithm-wise changes mean topology-aware gate and data-wise prediction mean shadow experts? If so, shadow policy is decided after gate and maybe it is not a prediction.

By the way, since Tutel and FasterMoE (along with others like SmartMoE, MegaBlock, Janus) emerged in 2022-2023, are there any other state-of-the-art frameworks designed to accelerate MoE training now? What is the remaining challenge in MoE training now? What type of framework is preferred in the industry for training MoE?

Thanks :)!