Closed Cospui closed 10 months ago
cc @philschmid for the AWS container update!
FWIW, the following entry in my requirements.txt updates the transformer and accelerate seems to work well: -
transformers==4.44.2 accelerate==0.34.0
I am starting with this image:
Framework | Job Type | CPU/GPU | Python Version Options | Example URL
-- | -- | -- | -- | --
PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04
System Info
transformers
version: 4.36.0.dev0Who can help?
@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in
trainer.py
L2382:When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run the MAE training code from the example folder.
Expected behavior
Solve the FileNotFound error.