Closed Jack47 closed 1 week ago
Describe the bug Just like this PR: https://github.com/microsoft/DeepSpeed/pull/5259 , ZeRO optimizer also needs to be fixed:
partition logic of expert params.
average_tensor used in gradient reduce in zero2
To Reproduce Steps to reproduce the behavior:
use ep=4 and adamw optimizer to train llm
Expected behavior expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1
@Jack47 Can you make a PR for this? Thanks!
https://github.com/microsoft/DeepSpeed/pull/5681 has solved it @Jack47
Describe the bug Just like this PR: https://github.com/microsoft/DeepSpeed/pull/5259 , ZeRO optimizer also needs to be fixed:
partition logic of expert params.
average_tensor used in gradient reduce in zero2
To Reproduce Steps to reproduce the behavior:
use ep=4 and adamw optimizer to train llm
Expected behavior expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1