Closed mosheisland closed 7 months ago
Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?
Thank you @mosheisland, this work is truly amazing! I deeply appreciate your effort for the thorough verification as well. Can you clarify a few things for reference in the future?
- Are the losses shown here only LM losses or do they include aux losses?
Moshe: For non-MoE runs, the losses are LM loss For MoE runs: "lm loss" is LM-loss only and "loss" is LM-loss + Aux-loss However, I see that it is hard to read the titles.
- Is it possible to share a chart of aux losses? We would like to make sure that the backprop worked properly.
Moshe: sure, below:
Aux loss
LM loss
Total loss (LM + Aux): Displayed only for GPTModelPipe
Color legend: with_pr = With both DeepSpeed required PR and this PR before = Without
Also there are two runs called gpt_3d_moe_with_pr and gpt_3d_moe_with_pr_new. Both are the same (the "new" one is after the last rebase).
- You verified with Z0. Do you have any reason that you didn't use Z1?
Moshe: I am mainly interested in running with BF16_Optimizer. BF16_Optimizer internally uses ZeRO=1.
@mosheisland Thank you for sharing the results! They all look good to me. Let me take a bit more time to review the PR on DS side.
@tohtana, https://github.com/microsoft/DeepSpeed/pull/5338 is merged. So I think now we can progress with this one.
Thank you @mosheisland, merged now.
Main changes:
NOTE that this PR is dependent on DeepSpeed PR#5338 https://github.com/microsoft/DeepSpeed/pull/5338
Below are tensorboard captures of tests to verify MoE support for pipeline and test no regressions. Testing was done with following configurations:
Training runs with and without this PR:
Training loss curve of a GPTModel model with fp16, No MOE (i.e. Dense network). Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel Comparing without vs with this PR.
Training loss curve of GPTModel model with fp16, with MOE (4 experts, top2). Scaling: 8xA100 DP=4 TP=2 PP=1, ZERO=0, Using GPTModel Comparing without vs with this PR.
Training loss curve of a GPTModelPipe model with BF16_Optimizer, No MOE (i.e. Dense network). Scaling: 8xA100 DP=2 TP=2 PP=2, ZERO=0, Using GPTModelPipe Comparing without vs with this PR
Comparing using with this PR:
At the beginning of the training, GPTModel fp16 is a little behind due to few steps of loss-scale adjustments. However, both configurations end up with very close loss.