Open theyorubayesian opened 1 week ago
During distributed training with Pytorch, the number of training steps increases with the number of processes.
To reproduce:
Transformers: 4.41.2 Torch: 2.3.1 Accelerate: 0.31.0
Distributing to 4 GPU devices trains for 500K steps.
torchrun --nproc_per_node=4 \ -m tevatron.driver.train \ --output_dir "$RUN_DIR" \ --model_name_or_path "$MODEL_PATH" \ --dataset_name "Tevatron/msmarco-passage" \ --per_device_train_batch_size 32 \ --num_train_epochs $NUM_EPOCHS \ --dataloader_drop_last True \
Distributing to 2 GPU devices trains for 250K steps.
torchrun --nproc_per_node=2 \ -m tevatron.driver.train \ --output_dir "$RUN_DIR" \ --model_name_or_path "$MODEL_PATH" \ --dataset_name "Tevatron/msmarco-passage" \ --per_device_train_batch_size 32 \ --num_train_epochs $NUM_EPOCHS \ --dataloader_drop_last True \
This happens because the dataloader is duplicated across the 4 GPUs instead of being sharded. HF moved sharding logic into the accelerator. The accelerator prepares the dataloader for the training configuration.
https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/trainer.py#L904
Here, the correct number of training steps should be 125K.
During distributed training with Pytorch, the number of training steps increases with the number of processes.
To reproduce:
Transformers: 4.41.2 Torch: 2.3.1 Accelerate: 0.31.0
Distributing to 4 GPU devices trains for 500K steps.
Distributing to 2 GPU devices trains for 250K steps.
This happens because the dataloader is duplicated across the 4 GPUs instead of being sharded. HF moved sharding logic into the accelerator. The accelerator prepares the dataloader for the training configuration.
https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/trainer.py#L904
Here, the correct number of training steps should be 125K.