huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.67k stars 26.7k forks source link

Save model checkpoint error when multi-gpu training #27925

Closed Cospui closed 10 months ago

Cospui commented 10 months ago

System Info

Who can help?

@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in trainer.py L2382:

        if staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:

        if self.args.should_save and staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

Information

Tasks

Reproduction

Run the MAE training code from the example folder.

Expected behavior

Solve the FileNotFound error.

ArthurZucker commented 1 month ago

cc @philschmid for the AWS container update!

solanki-ravi commented 1 month ago

FWIW, the following entry in my requirements.txt updates the transformer and accelerate seems to work well: -

transformers==4.44.2 accelerate==0.34.0

I am starting with this image:

Framework | Job Type | CPU/GPU | Python Version Options | Example URL
-- | -- | -- | -- | --
PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04