microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 345 forks source link

[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled #442

Open mksit opened 2 months ago

mksit commented 2 months ago

I have observed a recent change in LinearWithGradAccumulationAndAsyncCommunication to store the gradient of weights in WeightGradStore as a part of the new Zero Bubble Pipeline Parallelism feature (#396):

https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/tensor_parallel/layers.py#L370

However, the stored gradients are only accessed in deepspeed_zbh1_engine:

https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/pipeline_parallel/deepspeed_zbh1_engine.py#L108

If the Zero Bubble Pipeline Parallelism feature is not enabled, it seems that the gradients are not being returned. Is this an expected behavior?