[BUG] ZeRO optimizer with MoE Expert Parallelism

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://www.deepspeed.ai/

Apache License 2.0

34.93k stars 4.06k forks source link

Closed Jack47 closed 1 week ago

Jack47 commented 3 months ago

Describe the bug Just like this PR: https://github.com/microsoft/DeepSpeed/pull/5259 , ZeRO optimizer also needs to be fixed：

To Reproduce Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

Expected behavior expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

jomayeri commented 3 months ago

@Jack47 Can you make a PR for this? Thanks!

ranzhejiang commented 2 weeks ago