Save model checkpoint error when multi-gpu training

Cospui commented 10 months ago

System Info

transformers version: 4.36.0.dev0
Platform: Linux-6.2.0-1017-azure-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.0
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in trainer.py L2382:

        if staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:

        if self.args.should_save and staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run the MAE training code from the example folder.

Expected behavior

Solve the FileNotFound error.

ArthurZucker commented 1 month ago

cc @philschmid for the AWS container update!

solanki-ravi commented 1 month ago

FWIW, the following entry in my requirements.txt updates the transformer and accelerate seems to work well: -

transformers==4.44.2 accelerate==0.34.0

I am starting with this image:

Framework | Job Type | CPU/GPU | Python Version Options | Example URL
-- | -- | -- | -- | --
PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

huggingface / transformers