Support MoE for GPTModelPipe

mosheisland commented 7 months ago

Main changes:

Support MoE layers creation.
Support MoE aux loss for GPTModelPipe by propagating the aux loss along the layers.
Support display of MoE aux loss.

NOTE that this PR is dependent on DeepSpeed PR#5338 https://github.com/microsoft/DeepSpeed/pull/5338

Below are tensorboard captures of tests to verify MoE support for pipeline and test no regressions. Testing was done with following configurations:

Dense model: LLaMA like model with 8 layers.
MoE modles: Dense with --num-experts 4 --topk 2 --disable-moe-token-dropping --expert-interval 1

Training runs with and without this PR:

GPTModel Dense (No MoE) using DP, TP with fp16
GPTModelPipe Dense (No MoE) using DP, TP, PP with BF16_Optimizer
GPTModel MoE using DP, TP, EP with fp16
GPTModelPipe MoE using DP, TP, PP, EP with BF16_Optimizer (only with this PR)

Training loss curve of a GPTModel model with fp16, No MOE (i.e. Dense network). Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel Comparing without vs with this PR. GPTModel_2D_Dense_fp16_with_vs_without_PR

Training loss curve of GPTModel model with fp16, with MOE (4 experts, top2). Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel Comparing without vs with this PR. GPTModel_2D_MoE_fp16_with_vs_without_PR

Training loss curve of a GPTModelPipe model with BF16_Optimizer, No MOE (i.e. Dense network). Scaling: 8xA100 DP=2 TP=2 PP=2, ZERO=0, Using GPTModelPipe Comparing without vs with this PR GPTModelPipe_3D_Dense_bf16_with_vs_without_PR

Comparing using with this PR:

GPTModel DP=4 TP=2 PP=1 fp16
GPTModelPipe DP=2 TP=2 PP=2 BF16_Optimizer

At the beginning of the training, GPTModel fp16 is a little behind due to few steps of loss-scale adjustments. However, both configurations end up with very close loss. GPTModel_fp16_vs_GPTModelPipePipe_bf16_MOE

tohtana commented 7 months ago

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?

Are the losses shown here only LM losses or do they include aux losses?
Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.
You verified with Z0. Do you have any reason that you didn't use Z1?

mosheisland commented 7 months ago

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?

Are the losses shown here only LM losses or do they include aux losses?

Moshe: For non-MoE runs, the losses are LM loss For MoE runs: "lm loss" is LM-loss only and "loss" is LM-loss + Aux-loss However, I see that it is hard to read the titles.

Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.

Moshe: sure, below:

Aux loss

LM loss

Total loss (LM + Aux): Displayed only for GPTModelPipe

Color legend: with_pr = With both DeepSpeed required PR and this PR before = Without

Also there are two runs called gpt_3d_moe_with_pr and gpt_3d_moe_with_pr_new. Both are the same (the "new" one is after the last rebase).

You verified with Z0. Do you have any reason that you didn't use Z1?

Moshe: I am mainly interested in running with BF16_Optimizer. BF16_Optimizer internally uses ZeRO=1.

tohtana commented 7 months ago

@mosheisland Thank you for sharing the results! They all look good to me. Let me take a bit more time to review the PR on DS side.

mosheisland commented 7 months ago

@tohtana, https://github.com/microsoft/DeepSpeed/pull/5338 is merged. So I think now we can progress with this one.

tohtana commented 7 months ago

Thank you @mosheisland, merged now.

microsoft / Megatron-DeepSpeed

Support MoE for GPTModelPipe #373