When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.
Related issue & pr
Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here
PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed:
To Reproduce
1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2.
2.Perform a forward and backward propagation on both models.
3.Dump the gradient of the expert weights from both models.
4.The gradient of the expert weights in model2 is ep_size times that of model1.
Expected behavior
Gradient should be same under different ep_size.
Describe the bug
When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.
Related issue & pr Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed:
To Reproduce
1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2. 2.Perform a forward and backward propagation on both models. 3.Dump the gradient of the expert weights from both models. 4.The gradient of the expert weights in model2 is ep_size times that of model1.
Expected behavior Gradient should be same under different ep_size.