[BUG] Training batch size is not consistent with train_batch_size

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Apache License 2.0

35.57k stars 4.14k forks source link

Describe the bug For multi-GPU training, the number of batches per epoch does not reduce by the same factor as the number of GPUs.

To Reproduce For the configuration below, when using a dataset with 1 million samples and 4 GPUs, the number of batches (as obtained from the training dataloader length) is 62,500 (=1M/16) instead of 250,000 (=1M/4). "train_batch_size": 4, "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 1,

Expected behavior The number of batches for a multi-GPU setting should be (training data size )/ num_gpus, but it is not

microsoft / DeepSpeed

[BUG] Training batch size is not consistent with train_batch_size #6657