microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.57k stars 4.14k forks source link

[BUG] Training batch size is not consistent with train_batch_size #6657

Open tnnandi opened 1 month ago

tnnandi commented 1 month ago

Describe the bug For multi-GPU training, the number of batches per epoch does not reduce by the same factor as the number of GPUs.

To Reproduce For the configuration below, when using a dataset with 1 million samples and 4 GPUs, the number of batches (as obtained from the training dataloader length) is 62,500 (=1M/16) instead of 250,000 (=1M/4). "train_batch_size": 4, "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 1,

Expected behavior The number of batches for a multi-GPU setting should be (training data size )/ num_gpus, but it is not

tjruwase commented 1 month ago

@tnnandi, can you please share the log from a run using a smaller number of samples, e.g. 64 samples? This will help us investigate. Thanks!