microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 345 forks source link

Support MoE for GPTModelPipe #373

Closed mosheisland closed 7 months ago

mosheisland commented 7 months ago

Main changes:

NOTE that this PR is dependent on DeepSpeed PR#5338 https://github.com/microsoft/DeepSpeed/pull/5338

Below are tensorboard captures of tests to verify MoE support for pipeline and test no regressions. Testing was done with following configurations:

Training runs with and without this PR:

  1. GPTModel Dense (No MoE) using DP, TP with fp16
  2. GPTModelPipe Dense (No MoE) using DP, TP, PP with BF16_Optimizer
  3. GPTModel MoE using DP, TP, EP with fp16
  4. GPTModelPipe MoE using DP, TP, PP, EP with BF16_Optimizer (only with this PR)

Training loss curve of a GPTModel model with fp16, No MOE (i.e. Dense network). Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel Comparing without vs with this PR. GPTModel_2D_Dense_fp16_with_vs_without_PR

Training loss curve of GPTModel model with fp16, with MOE (4 experts, top2). Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel Comparing without vs with this PR. GPTModel_2D_MoE_fp16_with_vs_without_PR

Training loss curve of a GPTModelPipe model with BF16_Optimizer, No MOE (i.e. Dense network). Scaling: 8xA100 DP=2 TP=2 PP=2, ZERO=0, Using GPTModelPipe Comparing without vs with this PR GPTModelPipe_3D_Dense_bf16_with_vs_without_PR

Comparing using with this PR:

At the beginning of the training, GPTModel fp16 is a little behind due to few steps of loss-scale adjustments. However, both configurations end up with very close loss. GPTModel_fp16_vs_GPTModelPipePipe_bf16_MOE

tohtana commented 7 months ago

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?

mosheisland commented 7 months ago

Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?

  • Are the losses shown here only LM losses or do they include aux losses?

Moshe: For non-MoE runs, the losses are LM loss For MoE runs: "lm loss" is LM-loss only and "loss" is LM-loss + Aux-loss However, I see that it is hard to read the titles.

  • Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.

Moshe: sure, below:

Aux loss image

LM loss image

Total loss (LM + Aux): Displayed only for GPTModelPipe image

Color legend: image with_pr = With both DeepSpeed required PR and this PR before = Without

Also there are two runs called gpt_3d_moe_with_pr and gpt_3d_moe_with_pr_new. Both are the same (the "new" one is after the last rebase).

  • You verified with Z0. Do you have any reason that you didn't use Z1?

Moshe: I am mainly interested in running with BF16_Optimizer. BF16_Optimizer internally uses ZeRO=1.

tohtana commented 7 months ago

@mosheisland Thank you for sharing the results! They all look good to me. Let me take a bit more time to review the PR on DS side.

mosheisland commented 7 months ago

@tohtana, https://github.com/microsoft/DeepSpeed/pull/5338 is merged. So I think now we can progress with this one.

tohtana commented 7 months ago

Thank you @mosheisland, merged now.