Incorrect training steps for distributed setting

During distributed training with Pytorch, the number of training steps increases with the number of processes.

To reproduce:

Transformers: 4.41.2 Torch: 2.3.1 Accelerate: 0.31.0

Distributing to 4 GPU devices trains for 500K steps.

torchrun --nproc_per_node=4 \
-m tevatron.driver.train \
--output_dir "$RUN_DIR" \
--model_name_or_path "$MODEL_PATH" \
--dataset_name "Tevatron/msmarco-passage" \
--per_device_train_batch_size 32 \
--num_train_epochs $NUM_EPOCHS \
--dataloader_drop_last True \

Distributing to 2 GPU devices trains for 250K steps.

torchrun --nproc_per_node=2 \
-m tevatron.driver.train \
--output_dir "$RUN_DIR" \
--model_name_or_path "$MODEL_PATH" \
--dataset_name "Tevatron/msmarco-passage" \
--per_device_train_batch_size 32 \
--num_train_epochs $NUM_EPOCHS \
--dataloader_drop_last True \

This happens because the dataloader is duplicated across the 4 GPUs instead of being sharded. HF moved sharding logic into the accelerator. The accelerator prepares the dataloader for the training configuration.

https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/trainer.py#L904

Here, the correct number of training steps should be 125K.

texttron / tevatron

Incorrect training steps for distributed setting #137