Open mksit opened 2 months ago
I have observed a recent change in LinearWithGradAccumulationAndAsyncCommunication to store the gradient of weights in WeightGradStore as a part of the new Zero Bubble Pipeline Parallelism feature (#396):
LinearWithGradAccumulationAndAsyncCommunication
WeightGradStore
https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/tensor_parallel/layers.py#L370
However, the stored gradients are only accessed in deepspeed_zbh1_engine:
deepspeed_zbh1_engine
https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/pipeline_parallel/deepspeed_zbh1_engine.py#L108
If the Zero Bubble Pipeline Parallelism feature is not enabled, it seems that the gradients are not being returned. Is this an expected behavior?
I have observed a recent change in
LinearWithGradAccumulationAndAsyncCommunication
to store the gradient of weights inWeightGradStore
as a part of the new Zero Bubble Pipeline Parallelism feature (#396):https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/tensor_parallel/layers.py#L370
However, the stored gradients are only accessed in
deepspeed_zbh1_engine
:https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/pipeline_parallel/deepspeed_zbh1_engine.py#L108
If the Zero Bubble Pipeline Parallelism feature is not enabled, it seems that the gradients are not being returned. Is this an expected behavior?