[BUG] Gradient Accumulation Steps Initialization Bug in Pipeline Parallel Mode

Describe the bug I reviewed the initialization of self.gradient_accumulation_steps in the DeepSpeedConfig module when only train_batch and micro_batch are set (deepspeed Version: 0.13.1)：

grad_acc = train_batch // micro_batch
grad_acc //= self.world_size
self.gradient_accumulation_steps = grad_acc

However, in the PP+DP (Pipeline Parallel + Data Parallel) mode, not every rank is assigned a batch for training. Therefore, should the above formula replace self.world_size with dp_degree? Correspondingly, the check for train_batch should be：

 train_batch = grad_acc * micro_batch * dp_degree

The current initialization results in an unexpected calculation of grad_acc during my PP+DP training. I'm unsure if my understanding is incorrect; please correct me if necessary. Thank you.

microsoft / DeepSpeed

[BUG] Gradient Accumulation Steps Initialization Bug in Pipeline Parallel Mode #5410