DDP error with load_best_model_at_end enabled

System Info

transformers version: 4.40.1
Platform: Linux-5.10.214-202.855.amzn2.x86_64-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: #011not found
PyTorch version (GPU?): 2.3.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: ddp

Who can help?

@muellerzr and @pacman100

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Use DDP to trigger the training script torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS train.py --config /opt/ml/input/config/hyperparameters.json
In trainer argument set load_best_model_at_end to true
At the end of the script, all GPUs except the rank 0 emmit the following error

RuntimeError: DDP expects same model across all ranks, but Rank 5 has 128 params, while rank 0 has inconsistent 1506656875 params.
    return dist._verify_params_across_processes(process_group, tensors, logger)
    return dist._verify_params_across_processes(process_group, tensors, logger)RuntimeError

Expected behavior

No Error occur.

huggingface / transformers

DDP error with load_best_model_at_end enabled #30702

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior