microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.99k stars 4.06k forks source link

[BUG] Gradient Accumulation Steps Initialization Bug in Pipeline Parallel Mode #5410

Open fwyc0573 opened 5 months ago

fwyc0573 commented 5 months ago

Describe the bug I reviewed the initialization of self.gradient_accumulation_steps in the DeepSpeedConfig module when only train_batch and micro_batch are set (deepspeed Version: 0.13.1):

grad_acc = train_batch // micro_batch
grad_acc //= self.world_size
self.gradient_accumulation_steps = grad_acc

However, in the PP+DP (Pipeline Parallel + Data Parallel) mode, not every rank is assigned a batch for training. Therefore, should the above formula replace self.world_size with dp_degree? Correspondingly, the check for train_batch should be:

 train_batch = grad_acc * micro_batch * dp_degree

The current initialization results in an unexpected calculation of grad_acc during my PP+DP training. I'm unsure if my understanding is incorrect; please correct me if necessary. Thank you.

HackGiter commented 3 months ago

Can you provide more details? I think you're right. And grad_acc = pp_stages x times