microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.41k stars 4.11k forks source link

[BUG] Expert gradient scaling problem with ZeRO optimizer #6545

Open wyooyw opened 1 month ago

wyooyw commented 1 month ago

Describe the bug

When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.

Related issue & pr Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed:

To Reproduce

1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2. 2.Perform a forward and backward propagation on both models. 3.Dump the gradient of the expert weights from both models. 4.The gradient of the expert weights in model2 is ep_size times that of model1.

Expected behavior Gradient should be same under different ep_size.

wyooyw commented 1 month ago

I fixed the bug in PR [#6546]. The PR has not been merged yet.