Describe the bug
I reviewed the initialization of self.gradient_accumulation_steps in the DeepSpeedConfig module when only train_batch and micro_batch are set (deepspeed Version: 0.13.1):
However, in the PP+DP (Pipeline Parallel + Data Parallel) mode, not every rank is assigned a batch for training. Therefore, should the above formula replace self.world_size with dp_degree? Correspondingly, the check for train_batch should be:
train_batch = grad_acc * micro_batch * dp_degree
The current initialization results in an unexpected calculation of grad_acc during my PP+DP training. I'm unsure if my understanding is incorrect; please correct me if necessary. Thank you.
Describe the bug I reviewed the initialization of self.gradient_accumulation_steps in the DeepSpeedConfig module when only train_batch and micro_batch are set (deepspeed Version: 0.13.1):
However, in the PP+DP (Pipeline Parallel + Data Parallel) mode, not every rank is assigned a batch for training. Therefore, should the above formula replace self.world_size with dp_degree? Correspondingly, the check for train_batch should be:
The current initialization results in an unexpected calculation of grad_acc during my PP+DP training. I'm unsure if my understanding is incorrect; please correct me if necessary. Thank you.