texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
435 stars 87 forks source link

Incorrect training steps for distributed setting #137

Open theyorubayesian opened 1 week ago

theyorubayesian commented 1 week ago

During distributed training with Pytorch, the number of training steps increases with the number of processes.

To reproduce:

Transformers: 4.41.2 Torch: 2.3.1 Accelerate: 0.31.0

  1. Distributing to 4 GPU devices trains for 500K steps.

    torchrun --nproc_per_node=4 \
    -m tevatron.driver.train \
    --output_dir "$RUN_DIR" \
    --model_name_or_path "$MODEL_PATH" \
    --dataset_name "Tevatron/msmarco-passage" \
    --per_device_train_batch_size 32 \
    --num_train_epochs $NUM_EPOCHS \
    --dataloader_drop_last True \
  2. Distributing to 2 GPU devices trains for 250K steps.

    torchrun --nproc_per_node=2 \
    -m tevatron.driver.train \
    --output_dir "$RUN_DIR" \
    --model_name_or_path "$MODEL_PATH" \
    --dataset_name "Tevatron/msmarco-passage" \
    --per_device_train_batch_size 32 \
    --num_train_epochs $NUM_EPOCHS \
    --dataloader_drop_last True \

This happens because the dataloader is duplicated across the 4 GPUs instead of being sharded. HF moved sharding logic into the accelerator. The accelerator prepares the dataloader for the training configuration.

https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/trainer.py#L904

Here, the correct number of training steps should be 125K.