huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.06k stars 26.04k forks source link

DDP error with load_best_model_at_end enabled #30702

Open zhiyuanhhh opened 3 months ago

zhiyuanhhh commented 3 months ago

System Info

Who can help?

@muellerzr and @pacman100

Information

Tasks

Reproduction

  1. Use DDP to trigger the training script torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS train.py --config /opt/ml/input/config/hyperparameters.json
  2. In trainer argument set load_best_model_at_end to true
  3. At the end of the script, all GPUs except the rank 0 emmit the following error
RuntimeError: DDP expects same model across all ranks, but Rank 5 has 128 params, while rank 0 has inconsistent 1506656875 params.
    return dist._verify_params_across_processes(process_group, tensors, logger)
    return dist._verify_params_across_processes(process_group, tensors, logger)RuntimeError

Expected behavior

No Error occur.

amyeroberts commented 2 months ago

cc @muellerzr @SunMarc