Closed Cospui closed 10 months ago
I had this same issue, I temporarily fixed it by neutering the different staging directory:
if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
logger.warning(
f"Checkpoint destination directory {output_dir} already exists and is non-empty."
"Saving will proceed but saved results may be invalid."
)
staging_output_dir = output_dir
else:
# staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
staging_output_dir = output_dir
I had this same issue, I temporarily fixed it by neutering the different staging directory:
if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0: logger.warning( f"Checkpoint destination directory {output_dir} already exists and is non-empty." "Saving will proceed but saved results may be invalid." ) staging_output_dir = output_dir else: # staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}") staging_output_dir = output_dir
Where did you insert this?
Facing same issue in multi-node training:
File "/home/user/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2353, in _save_checkpoint self.save_model(staging_output_dir, _internal_call=True) RuntimeError: Parent directory tmp-checkpoint-200 does not exist.
It added annoying tmp- in front of the checkpoint
This is a showstopper for training on multi-GPU nodes. The culprit seems to be the following merged PR #27820.
There is an open PR #27929, which seems to fix the issue. @ArthurZucker @sgugger @younesbelkada
Hi all, can you please do pip install git+https://github.com/huggingface/transformers
and rerun your code? This should fix your issue now.
Thank you very much for your patience and flagging this!
@muellerzr @thundergolfer I still get the same issue of saving checkpoint using the latest version of transformers 4.36
and even with ā4.37.0.dev0ā
I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error!
FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2395, in _save_checkpoint
os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'
although the model/checkpoint-49
is already created!
@hahmad2008 can you try doing either pip install transformers -U
or reinstall from git? From the line numbers it's not adding up that you're using a version that includes the fix
I encountered this issue with the trainer with the following command-line. This was after recently updating transformers with pip install transformers --upgrade
--save_strategy epoch --save_total_limit 1
transformers==4.36.2
Edit: One thing to note this was with 2 nodes with 8x A100s per node. Looking at the code around the error, I have a feeling this was because I may have used local=True when using with main_process_first. Going to try disabling save_on_each_node.
if staging_output_dir != output_dir:
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
edit edit: Looks like its still not working even when specifying save_on_each_node to false.
Here is the full command, launched from a slurm sbatch job:
srun --kill-on-bad-exit=1 --jobid $SLURM_JOB_ID bash -c "accelerate launch --use_deepspeed --zero_stage 1 --deepspeed_hostfile hostfile --deepspeed_multinode_launcher openmpi --gradient_accumulation_steps 1 --num_processes $(( $NUM_GPUS * $COUNT_NODE )) --num_machines $COUNT_NODE --num_cpu_threads_per_process $CPU_COUNT --mixed_precision bf16 --machine_rank \$SLURM_PROCID --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT main.py --source_datasets_filepath source_data/clm --output_dir testing_output_cluster --model_number 2 --overwrite_output_dir --dataloader_num_workers 10 --bf16 --data_fraction 0.1 --save_strategy steps --save_total_limit 1 --save_on_each_node false --dataloader_num_workers 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --max_token_length 1024 --num_train_epochs 1"
I encountered a similar error when using the trainer from DeepSpeed.
The error occurs at the exact moment after if os.path.exists(staging_output_dir):
is evaluated and another process finishes renaming.
I had no other choice, so I resorted to using a try block to get around it.
if staging_output_dir != output_dir:
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
if os.path.exists(staging_output_dir):
try:
os.rename(staging_output_dir, output_dir)
except Exception as e:
logger.info(f"Could not rename checkpoint directory from {staging_output_dir} to {output_dir}. Reason: {e}")
transformers-4.37.0.dev0
Hi, @snowyday , @tblattner , and @muellerzr . I think main_process_first
may be broken.
I run the trainer with 2 nodes X 8 V100 GPUs and deepspeed. When I turned on log_level=debug
, I found that only one process entered the waiting mode, while all other processes tried to save the checkpoint.
The log from process that waited:
[DEBUG|training_args.py:2119] 2023-12-27 15:11:30,917 >> 4: waiting for the main process to perform Renaming model checkpoint folder to true location
I also encounter this with 4.36.2 and HEAD in a multi-node multi-GPU setup. Looks like an obvious race condition, as it happens indeterminately (sometimes 2nd save, sometimes 7th save etc).
Hi Any update or final conclusion here? :>
any solutions? facing the same issue on multinode training using deepspeed
same here, any solutions?
I've been using a try-except approach for bypassing the issue, and it's been working well for me. However, as xk-huang mentioned, it seems that the root cause is that self.args.main_process_first is not handling multi-node setups properly.
Curious if there is any reason why we must do os.path.exists
and os.rename
for each process, why not just the main process(es)?
Haven't tested this code yet as my compute resources are currently filled and I have a long-running experiment set to finish in a couple days, but wanted to get some thoughts on this potential solution.
# Only rename from main process to avoid race condition from other processes especially for distributed filesystems
if staging_output_dir != output_dir:
if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
self.args.distributed_state.wait_for_everyone()
I'm using transformers's Trainer, is there any work around for this?
For work around with Trainer, I just subclassed it and replace the _save_checkpoint method that added try exception.
class CustomTrainer(Trainer):
def _save_checkpoint(self, model, trial, metrics=None):
# In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
# want to save except FullyShardedDDP.
# assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"
# Save model checkpoint
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
if self.hp_search_backend is None and trial is None:
self.store_flos()
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
logger.warning(
f"Checkpoint destination directory {output_dir} already exists and is non-empty."
"Saving will proceed but saved results may be invalid."
)
staging_output_dir = output_dir
else:
staging_output_dir = os.path.join(
run_dir, f"tmp-{checkpoint_folder}")
self.save_model(staging_output_dir, _internal_call=True)
if not self.args.save_only_model:
# Save optimizer and scheduler
self._save_optimizer_and_scheduler(staging_output_dir)
# Save RNG state
self._save_rng_state(staging_output_dir)
# Determine the new best metric / best model checkpoint
if metrics is not None and self.args.metric_for_best_model is not None:
metric_to_check = self.args.metric_for_best_model
if not metric_to_check.startswith("eval_"):
metric_to_check = f"eval_{metric_to_check}"
metric_value = metrics[metric_to_check]
operator = np.greater if self.args.greater_is_better else np.less
if (
self.state.best_metric is None
or self.state.best_model_checkpoint is None
or operator(metric_value, self.state.best_metric)
):
self.state.best_metric = metric_value
self.state.best_model_checkpoint = output_dir
# Save the Trainer state
if self.args.should_save:
self.state.save_to_json(os.path.join(
staging_output_dir, TRAINER_STATE_NAME))
if self.args.push_to_hub:
self._push_from_checkpoint(staging_output_dir)
# Place checkpoint in final location after all saving is finished.
# First wait for everyone to finish writing
self.args.distributed_state.wait_for_everyone()
# Then go through the rewriting process starting on process 0
try:
if staging_output_dir != output_dir:
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
# Maybe delete some older checkpoints.
if self.args.should_save:
self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
except Exception:
print("Error rotating checkpoints skipping")
pass
I've checked the main_process_first
using the code snippet below:
Number of nodes: 3
Processes per node (GPUs): 4
Total: 12 processes
import logging
import deepspeed
import transformers
import torch
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
deepspeed.init_distributed()
node_rank = torch.distributed.get_rank()
training_args = transformers.TrainingArguments(per_device_train_batch_size=8,
gradient_accumulation_steps=2,
num_train_epochs=3,
deepspeed="ds_config/ds_config_zero3.json",
output_dir="logs")
with training_args.main_process_first():
logger.info(f"Check `main_process_first`. Node rank {node_rank}")
Address family not supported by protocol).
[INFO:root:Check `main_process_first`. Node rank 8
INFO:root:Check `main_process_first`. Node rank 0
INFO:root:Check `main_process_first`. Node rank 4
INFO:root:Check `main_process_first`. Node rank 6
INFO:root:Check `main_process_first`. Node rank 10
INFO:root:Check `main_process_first`. Node rank 5
INFO:root:Check `main_process_first`. Node rank 9
INFO:root:Check `main_process_first`. Node rank 1
INFO:root:Check `main_process_first`. Node rank 2
INFO:root:Check `main_process_first`. Node rank 3
INFO:root:Check `main_process_first`. Node rank 7
INFO:root:Check `main_process_first`. Node rank 11
The node rankings appear to be correctly allocated, with Node rank 0 going to node 1, Node rank 4 to node 2, and Node rank 8 to node 3; however, there are inaccuracies with the global rankings. In the context of a shared filesystem, if we proceed without waiting for the result from global rank 0, it could cause conflicts during the os.rename operation.
if staging_output_dir != output_dir:
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
however, there are inaccuracies with the global rankings.
@snowyday as indicated by the fact that rank 8
is printed first?
@thundergolfer
Rank 0
should pop up first, and the others should hang tight until the renaming wraps up.
I should set args.save_on_each_node=False
:
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
Without having tested, this looks like the right direction.
In the end, simply setting save_on_each_node=False
worked out for everything.
training_args = transformers.TrainingArguments(..., save_on_each_node=False, ...)
By setting save_on_each_node=False
in TrainingArguments
, it ensures that in the Trainer
ās def _save_checkpoint method
, main_process_first
's local
will be set to False
. Consequently, following the explanation provided, it works correctly.
if False
first means process of rank 0 of node rank 0 In multi-node environment with a shared filesystem you most likely will want to use local=False
so that only the main process of the first node will do the processing.
@contextlib.contextmanager
def main_process_first(self, local=True, desc="work"):
"""
A context manager for torch distributed environment where on needs to do something on the main process, while
blocking replicas, and when it's finished releasing the replicas.
One such use is for `datasets`'s `map` feature which to be efficient should be run once on the main process,
which upon completion saves a cached version of results and which then automatically gets loaded by the
replicas.
Args:
local (`bool`, *optional*, defaults to `True`):
if `True` first means process of rank 0 of each node if `False` first means process of rank 0 of node
rank 0 In multi-node environment with a shared filesystem you most likely will want to use
`local=False` so that only the main process of the first node will do the processing. If however, the
filesystem is not shared, then the main process of each node will need to do the processing, which is
the default behavior.
desc (`str`, *optional*, defaults to `"work"`):
a work description to be used in debug logs
"""
Would this work in a setting without a shared file system?
I'v checked it on a GPU cluster with a shared file system. For multi-node setups with independent file systems, the default save_on_each_node=True
is fine; main_process_first
make sure to serialize the execution for each node. If that still doesn't work, then I think there might still be an issue with main_process_first
.
I don't think there is an issue with main_process_first as I've been using it across a lot of dataset processing steps.
I believe that on network/shared file systems os.rename is not atomic. So its possible that the file system in this case might not be reflected after os.rename returns, causing other processes to observe the wrong state. I haven't found a good way to ensure the rename is completed. Catching the exception would handle it though, but not my ideal way to deal with the race condition.
In the case of processes sharing a filesystem, it seems prudent for only one process to wait for a rename operation to complete. However, why main_process_first
is being used? On a shared filesystem, if the rename()
fails, options are limited. Is this why multiple processes are making repeated attempts?
I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.
My suggestion is to use something like this:
if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
Then self.args.distributed_state.wait_for_everyone()
to synchronize everyone afterwards.
This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...
It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.
It is, so we could have a race condition. An fsync
could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?
FYI, we tested and also experienced this without shared FS (accelerate/pdsh
, simple two-node setup).
Also, if we rely on full fsync
implementation in checkpoint folder, it might be good to explicitly call that out in docs as not all filesystems/mount options will fail hard on "fake" fsync
calls.
It is, so we could have a race condition. An
fsync
could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?
I can get a start on a PR. Not sure what the best methodology for running fsync on a rename operation is, but I'll give it a shot.
I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.
My suggestion is to use something like this:
if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
Then
self.args.distributed_state.wait_for_everyone()
to synchronize everyone afterwards.This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...
It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.
That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see:
[Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856
.
any updates?
This was fixed by the PR I believe !
A similar error has now occurred at L.2561 89c6481
I am experiencing this issue in a distributed training environment that utilizes a shared file system across 16 nodes, with each node equipped with 4 GPUs. I'm deploying the training using DeepSpeed's OpenMPI launcher.
In this setup, I have observed scenarios where the cleanup command shutil.rmtree(staging_output_dir) at L.2561 in the code fails to execute due to the condition self.is_local_process_zero() not being met on the slave nodes. This is intended to "Clean up the remaining staging checkpoint folders on other nodes," but it does not always work as expected.
File "XXX/transformers/src/transformers/trainer.py", line 2561, in _save_checkpoint
shutil.rmtree(staging_output_dir)
File "XXX/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth' os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth' os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: FileNotFoundError: os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: FileNotFoundError: os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth' os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'[Errno 2] No such file or directory: 'rng_state_6.pth'
[89c6481]
# Then go through the rewriting process, only renaming and rotating from main process(es)
if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
if staging_output_dir != output_dir:
if os.path.exists(staging_output_dir):
try:
os.rename(staging_output_dir, output_dir)
except Exception as e:
logger.error(
f"Error occurred when attempting to rename checkpoint folder: {e}\n"
"The checkpoint folder will not be renamed, but the training will proceed."
)
# Ensure rename completed in cases where os.rename is not atomic
# And can only happen on non-windows based systems
if os.name != "nt":
fd = os.open(output_dir, os.O_RDONLY)
os.fsync(fd)
os.close(fd)
# Maybe delete some older checkpoints.
if self.args.should_save:
# Solely rely on numerical checkpoint id for rotation.
# mtime is not reliable especially on some fuse fs in cloud environments.
self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
elif self.is_local_process_zero():
# Clean up the remaining staging checkpoint folders on other nodes
if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
shutil.rmtree(staging_output_dir) @L.2561
self.args.distributed_state.wait_for_everyone()
Although os.path.exists(staging_output_dir)
is used for verification, it seems that staging_output_dir
does not exist when shutil.rmtree(staging_output_dir)
is executed. It looks like a try-except block needs to be implemented here as well.
if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
try:
shutil.rmtree(staging_output_dir) @L.2561
except Exception as e:
logger.error(
f"Error occurred when attempting to delete checkpoint folder: {e}\n"
)
if os.name != "nt":
fd = os.open(staging_output_dir, os.O_RDONLY)
os.fsync(fd)
os.close(fd)
Hi @snowyday - could you open a new issue, including all these details and linking to this issue? This way we can better track what's been addressed and what's a new issue
Hello @amyeroberts & @snowyday , I just wanted to share that I have encountered almost similar issue while using transformer 4.37.0 on Windows 10 (as admin) with single GPU. The error I got read as follows:
\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint fd = os.open(output_dir, os.O_RDONLY) PermissionError: [Errno 13] Permission denied: '.
Hi @chercheurkg, have you tried on the latest release? There was a patch release for 4.37 which should have addressed this.
@amyeroberts , Thanks for your reply! As per your suggestion, on the same machine, I used transformer version 4.37. However, it did not work for me. I got the same error.
Ah, sorry, wasn't clear, I meant to use either 4.37.2 or 4.38.1
in my case, 4.38.2 also faces this issue. when upgraded to 4.37.2 on all nodes, it gets fixed.
@DreamInvoker Could you try running on main
? pip install git+https://github.com/huggingface/transformers
I also meet the same problem in 4,38.2. Using the 4.37,2 fix this issue.
I have tried the latest version v3.40.0 with overwrite_output_dir=False . Everything works well.
I'm working on 4 nodes(32 GPU) sharing the same filesystem.
When using v4.39.0, No such file or directory: 'model/tmp-checkpoint-100' -> 'model/checkpoint-100'
occurs.
After turning to v4.37.2 I encounter a new problem.
My first setting is shown below.
do_train=True,
do_eval=False,
save_strategy="steps",
save_steps=100
save_total_limit=5
overwrite_output_dir=True
The model stop saving the ckpt after 900 although my global step is 1300.
Then I train a new model with overwrite_output_dir=False
,
same issue
FileNotFoundError: [Errno 2] No such file or directory:
Did you try with transformers==4.39.1
?
I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here. My suggestion is to use something like this:
if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
Thenself.args.distributed_state.wait_for_everyone()
to synchronize everyone afterwards. This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there... It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see:
[Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856
.
Hi, were you able to get rid of this error? Thanks
staging_output_dir = output_dir
staging_output_dir = output_dir
@ArthurZucker Can the AWS HuggingFace DL containers be updated as well? Currently Training Images are using Transformers 4.36.0 and impacted by this issue (i.e. All Training Jobs using Distributed Training with checkpoints are failing with this error, see log below).
Existing HuggingFace DL Container Images: https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers
Transformer Version: PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04
HuggingFace Trainer on Sagemaker Logs:
ErrorMessage "FileNotFoundErrorFileNotFoundErrorFileNotFoundError: : : [Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'
100%|āāāāāāāāāā| 2903/2903 [2:35:27<00:00, 3.21s/it]
[2024-08-30 09:19:24,433] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65 closing signal SIGTERM
[2024-08-30 09:19:24,997] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 63) of binary: /opt/conda/bin/python
Traceback (most recent call last)
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
============================================================
train_fsdp.py FAILED
------------------------------------------------------------
Failures
[1]
time : 2024-08-30_09:19:24
host : algo-3
rank : 9 (local_rank: 1)
exitcode : 1 (pid: 64)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]
rank : 11 (local_rank: 3)
exitcode : 1 (pid: 66)
Root Cause (first observed failure)
[0]
rank : 8 (local_rank: 0)
exitcode : 1 (pid: 63)"
System Info
transformers
version: 4.36.0.dev0Who can help?
@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in
trainer.py
L2382:When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run the MAE training code from the example folder.
Expected behavior
Solve the FileNotFound error.