[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3.

microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.89k stars 344 forks source link

[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3. #428

Closed ys950902 closed 1 month ago

ys950902 commented 3 months ago

When running the Megatron-DeepSpeed with DeepSpeed on zero3, the grad_weight is set to None by default, that will cause the error issue below: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1253 AttributeError: 'NoneType' object has no attribute 'numel' This pr is to fix this error issue, the weight.grad equals to grad_weight.

ys950902 commented 3 months ago

@tjruwase could you please take a look on this pr, it will block the users to use the zero3 stage.

ys950902 commented 2 months ago

Any concern for this pr?

tjruwase commented 2 months ago

@ys950902, apologies for the delay on this PR. I am confused about this issue since Megatron-DeepSpeed and ZeRO3 have always been compatible. Can you share some repro details to help my understanding? Thanks!

ys950902 commented 1 month ago

@ys950902, apologies for the delay on this PR. I am confused about this issue since Megatron-DeepSpeed and ZeRO3 have always been compatible. Can you share some repro details to help my understanding? Thanks!

Hi @tjruwase, sorry for late response, for zero-bubble added in latest Megatron-DeepSpeed, the backward computation will be divided in two parts(weight gradient and output), and the weight gradient will not compute at once and will set to be None, when in running zero2/3 will divided the gradient may cause some unexpected error, this pr is to add the flag when doing pipeline-parallelism go the current path, when doing zero2/3 go the former path.

ys950902 commented 1 month ago

Hi @tjruwase, if you still have some concern for this pr, please let me know.

tjruwase commented 1 month ago

Hi @tjruwase, sorry for late response, for zero-bubble added in latest Megatron-DeepSpeed, the backward computation will be divided in two parts(weight gradient and output),

Thanks for the explanation. In that case, shouldn't this behavior depend on whether zero-bubble is enabled? In other words, check for args.enable_zbh1_pipeline. Can you please clarify?

ys950902 commented 1 month ago

Hi @tjruwase, sorry for late response, for zero-bubble added in latest Megatron-DeepSpeed, the backward computation will be divided in two parts(weight gradient and output),

Thanks for the explanation. In that case, shouldn't this behavior depend on whether zero-bubble is enabled? In other words, check for args.enable_zbh1_pipeline. Can you please clarify?

Thanks for your reply, yes, maybe check for args.enable_zbh1_pipeline is more reasonable, because for pipelien-parallelism, only supported for ZERO1/ZERO0 on Deepspeed, it won't divided the gradient, so is also okay set to NONE on weight gradient calculation. I will modify the check to args.enable_zbh1_pipeline to avoid the confuse.

ys950902 commented 1 month ago

Hi @tjruwase, I noticed the bug issue https://github.com/microsoft/Megatron-DeepSpeed/issues/442 in zero-bubble, so I did a little modified to better understanding for customers, and this pr is the solution for this bug, Only enable args.enable_zbh1_pipeline go this path calculate the weight gradient when schedule is pop, when is not enable go the former path to calculate the weight gradient. I think this is more clear. And if approved, could you please also merge this pr, then we can upgrade the Megatron-DeepSpeed to latest for next release.