Open KyleMylonakisProtopia opened 8 months ago
Hi @KyleMylonakisProtopia Please give this PR a try, hopefully it resolves the issue. Best, Reza
That PR seems to resolve the issue. Thanks for looking at it!
@tjruwase, let's please close this and merge the PR :)
Describe the bug When performing a training run with a model with Mixture of Experts (MoE) layers using stage 2 offload with the DeepSpeedCPUAdam optimizer, during the parameter update step the following runtime error is thrown.
When using a
ep_size=1
for the expert layers, the call toself._average_expert_grad_norms(norm_groups)
is not necessary and commenting this out resolves the issue. This of course is not a general solution forep_size > 1
, however in my case it would be sufficient to continue my work.To Reproduce Steps to reproduce the behavior:
DeepSpeedCPUAdam
optimizer for efficient CPU offloadExpected behavior Model training should occur with no issues or errors thrown.
ds_report output
Screenshots N/A
System info (please complete the following information):
Launcher context Pytorch Lightning
Docker context Bare metal.
Additional context I have
ep_size=1
for my mixture of expert layers, so this bug is totally avoidable by just not having the all reduce step.