microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.93k stars 4.06k forks source link

[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

Closed Jack47 closed 1 week ago

Jack47 commented 3 months ago

Describe the bug Just like this PR: https://github.com/microsoft/DeepSpeed/pull/5259 , ZeRO optimizer also needs to be fixed:

  1. partition logic of expert params.

    image
  2. average_tensor used in gradient reduce in zero2

    image

To Reproduce Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

Expected behavior expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

jomayeri commented 3 months ago

@Jack47 Can you make a PR for this? Thanks!

ranzhejiang commented 2 weeks ago

https://github.com/microsoft/DeepSpeed/pull/5681 has solved it @Jack47