huggingface / transformers

šŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.66k stars 26.7k forks source link

Save model checkpoint error when multi-gpu training #27925

Closed Cospui closed 10 months ago

Cospui commented 10 months ago

System Info

Who can help?

@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in trainer.py L2382:

        if staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:

        if self.args.should_save and staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

Information

Tasks

Reproduction

Run the MAE training code from the example folder.

Expected behavior

Solve the FileNotFound error.

jquesnelle commented 10 months ago

I had this same issue, I temporarily fixed it by neutering the different staging directory:

if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
else:
    # staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
    staging_output_dir = output_dir
staticpunch commented 10 months ago

I had this same issue, I temporarily fixed it by neutering the different staging directory:

if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
else:
    # staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
    staging_output_dir = output_dir

Where did you insert this?

Andcircle commented 10 months ago

Facing same issue in multi-node training: File "/home/user/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2353, in _save_checkpoint self.save_model(staging_output_dir, _internal_call=True) RuntimeError: Parent directory tmp-checkpoint-200 does not exist. It added annoying tmp- in front of the checkpoint

peter-sk commented 10 months ago

This is a showstopper for training on multi-GPU nodes. The culprit seems to be the following merged PR #27820.

peter-sk commented 10 months ago

There is an open PR #27929, which seems to fix the issue. @ArthurZucker @sgugger @younesbelkada

muellerzr commented 10 months ago

Hi all, can you please do pip install git+https://github.com/huggingface/transformers and rerun your code? This should fix your issue now.

Thank you very much for your patience and flagging this!

hahmad2008 commented 10 months ago

@muellerzr @thundergolfer I still get the same issue of saving checkpoint using the latest version of transformers 4.36 and even with ā€˜4.37.0.dev0ā€™

I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error!

FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2395, in _save_checkpoint
    os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

although the model/checkpoint-49is already created!

muellerzr commented 10 months ago

@hahmad2008 can you try doing either pip install transformers -U or reinstall from git? From the line numbers it's not adding up that you're using a version that includes the fix

tblattner commented 10 months ago

I encountered this issue with the trainer with the following command-line. This was after recently updating transformers with pip install transformers --upgrade

--save_strategy epoch --save_total_limit 1

transformers==4.36.2

Edit: One thing to note this was with 2 nodes with 8x A100s per node. Looking at the code around the error, I have a feeling this was because I may have used local=True when using with main_process_first. Going to try disabling save_on_each_node.

        if staging_output_dir != output_dir:
            with self.args.main_process_first(
                desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
            ):
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

edit edit: Looks like its still not working even when specifying save_on_each_node to false.

Here is the full command, launched from a slurm sbatch job:

srun --kill-on-bad-exit=1 --jobid $SLURM_JOB_ID bash -c "accelerate launch --use_deepspeed --zero_stage 1 --deepspeed_hostfile hostfile --deepspeed_multinode_launcher openmpi --gradient_accumulation_steps 1 --num_processes $(( $NUM_GPUS * $COUNT_NODE )) --num_machines $COUNT_NODE --num_cpu_threads_per_process $CPU_COUNT --mixed_precision bf16 --machine_rank \$SLURM_PROCID --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT main.py --source_datasets_filepath source_data/clm --output_dir testing_output_cluster --model_number 2 --overwrite_output_dir --dataloader_num_workers 10 --bf16 --data_fraction 0.1 --save_strategy steps --save_total_limit 1 --save_on_each_node false --dataloader_num_workers 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --max_token_length 1024 --num_train_epochs 1"
snowyday commented 9 months ago

I encountered a similar error when using the trainer from DeepSpeed. The error occurs at the exact moment after if os.path.exists(staging_output_dir): is evaluated and another process finishes renaming.

I had no other choice, so I resorted to using a try block to get around it.

if staging_output_dir != output_dir:
    with self.args.main_process_first(
        desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
    ):
        if os.path.exists(staging_output_dir):
            try:
                os.rename(staging_output_dir, output_dir)
            except Exception as e:
                logger.info(f"Could not rename checkpoint directory from {staging_output_dir} to {output_dir}. Reason: {e}")

transformers-4.37.0.dev0

xk-huang commented 9 months ago

Hi, @snowyday , @tblattner , and @muellerzr . I think main_process_first may be broken.

I run the trainer with 2 nodes X 8 V100 GPUs and deepspeed. When I turned on log_level=debug, I found that only one process entered the waiting mode, while all other processes tried to save the checkpoint.

The log from process that waited:

[DEBUG|training_args.py:2119] 2023-12-27 15:11:30,917 >> 4: waiting for the main process to perform Renaming model checkpoint folder to true location
peter-sk commented 9 months ago

I also encounter this with 4.36.2 and HEAD in a multi-node multi-GPU setup. Looks like an obvious race condition, as it happens indeterminately (sometimes 2nd save, sometimes 7th save etc).

lzy37ld commented 9 months ago

Hi Any update or final conclusion here? :>

roynirmal commented 9 months ago

any solutions? facing the same issue on multinode training using deepspeed

luvwinnie commented 9 months ago

same here, any solutions?

snowyday commented 9 months ago

I've been using a try-except approach for bypassing the issue, and it's been working well for me. However, as xk-huang mentioned, it seems that the root cause is that self.args.main_process_first is not handling multi-node setups properly.

tblattner commented 9 months ago

Curious if there is any reason why we must do os.path.exists and os.rename for each process, why not just the main process(es)?

Haven't tested this code yet as my compute resources are currently filled and I have a long-running experiment set to finish in a couple days, but wanted to get some thoughts on this potential solution.

        # Only rename from main process to avoid race condition from other processes especially for distributed filesystems
        if staging_output_dir != output_dir:
            if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

            self.args.distributed_state.wait_for_everyone()
luvwinnie commented 9 months ago

I'm using transformers's Trainer, is there any work around for this?

luvwinnie commented 9 months ago

For work around with Trainer, I just subclassed it and replace the _save_checkpoint method that added try exception.

class CustomTrainer(Trainer):
    def _save_checkpoint(self, model, trial, metrics=None):
        # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
        # want to save except FullyShardedDDP.
        # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"

        # Save model checkpoint
        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

        if self.hp_search_backend is None and trial is None:
            self.store_flos()

        run_dir = self._get_output_dir(trial=trial)
        output_dir = os.path.join(run_dir, checkpoint_folder)
        if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
            logger.warning(
                f"Checkpoint destination directory {output_dir} already exists and is non-empty."
                "Saving will proceed but saved results may be invalid."
            )
            staging_output_dir = output_dir
        else:
            staging_output_dir = os.path.join(
                run_dir, f"tmp-{checkpoint_folder}")
        self.save_model(staging_output_dir, _internal_call=True)

        if not self.args.save_only_model:
            # Save optimizer and scheduler
            self._save_optimizer_and_scheduler(staging_output_dir)
            # Save RNG state
            self._save_rng_state(staging_output_dir)

        # Determine the new best metric / best model checkpoint
        if metrics is not None and self.args.metric_for_best_model is not None:
            metric_to_check = self.args.metric_for_best_model
            if not metric_to_check.startswith("eval_"):
                metric_to_check = f"eval_{metric_to_check}"
            metric_value = metrics[metric_to_check]

            operator = np.greater if self.args.greater_is_better else np.less
            if (
                self.state.best_metric is None
                or self.state.best_model_checkpoint is None
                or operator(metric_value, self.state.best_metric)
            ):
                self.state.best_metric = metric_value
                self.state.best_model_checkpoint = output_dir

        # Save the Trainer state
        if self.args.should_save:
            self.state.save_to_json(os.path.join(
                staging_output_dir, TRAINER_STATE_NAME))

        if self.args.push_to_hub:
            self._push_from_checkpoint(staging_output_dir)

        # Place checkpoint in final location after all saving is finished.
        # First wait for everyone to finish writing
        self.args.distributed_state.wait_for_everyone()
        # Then go through the rewriting process starting on process 0
        try:
            if staging_output_dir != output_dir:
                with self.args.main_process_first(
                    desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
                ):
                    if os.path.exists(staging_output_dir):
                        os.rename(staging_output_dir, output_dir)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
        except Exception:
            print("Error rotating checkpoints skipping")
            pass
snowyday commented 9 months ago

I've checked the main_process_first using the code snippet below: Number of nodes: 3 Processes per node (GPUs): 4 Total: 12 processes

import logging

import deepspeed
import transformers
import torch

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

if __name__ == "__main__":
    deepspeed.init_distributed()
    node_rank = torch.distributed.get_rank()   
    training_args = transformers.TrainingArguments(per_device_train_batch_size=8,
                                                   gradient_accumulation_steps=2,
                                                   num_train_epochs=3,
                                                   deepspeed="ds_config/ds_config_zero3.json",
                                                   output_dir="logs")

    with training_args.main_process_first():
        logger.info(f"Check `main_process_first`. Node rank {node_rank}")
Address family not supported by protocol).
[INFO:root:Check `main_process_first`. Node rank 8
INFO:root:Check `main_process_first`. Node rank 0
INFO:root:Check `main_process_first`. Node rank 4
INFO:root:Check `main_process_first`. Node rank 6
INFO:root:Check `main_process_first`. Node rank 10
INFO:root:Check `main_process_first`. Node rank 5
INFO:root:Check `main_process_first`. Node rank 9
INFO:root:Check `main_process_first`. Node rank 1
INFO:root:Check `main_process_first`. Node rank 2
INFO:root:Check `main_process_first`. Node rank 3
INFO:root:Check `main_process_first`. Node rank 7
INFO:root:Check `main_process_first`. Node rank 11

The node rankings appear to be correctly allocated, with Node rank 0 going to node 1, Node rank 4 to node 2, and Node rank 8 to node 3; however, there are inaccuracies with the global rankings. In the context of a shared filesystem, if we proceed without waiting for the result from global rank 0, it could cause conflicts during the os.rename operation.

if staging_output_dir != output_dir:
    with self.args.main_process_first(
        desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
    ):
        if os.path.exists(staging_output_dir):
            os.rename(staging_output_dir, output_dir)
thundergolfer commented 9 months ago

however, there are inaccuracies with the global rankings.

@snowyday as indicated by the fact that rank 8 is printed first?

snowyday commented 9 months ago

@thundergolfer Rank 0 should pop up first, and the others should hang tight until the renaming wraps up. I should set args.save_on_each_node=False:

with self.args.main_process_first(
        desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
    ):
peter-sk commented 9 months ago

Without having tested, this looks like the right direction.

snowyday commented 9 months ago

In the end, simply setting save_on_each_node=False worked out for everything.

training_args = transformers.TrainingArguments(..., save_on_each_node=False, ...)

By setting save_on_each_node=False in TrainingArguments, it ensures that in the Trainerā€™s def _save_checkpoint method, main_process_first's local will be set to False. Consequently, following the explanation provided, it works correctly.

if False first means process of rank 0 of node rank 0 In multi-node environment with a shared filesystem you most likely will want to use local=False so that only the main process of the first node will do the processing.

    @contextlib.contextmanager
    def main_process_first(self, local=True, desc="work"):
        """
        A context manager for torch distributed environment where on needs to do something on the main process, while
        blocking replicas, and when it's finished releasing the replicas.

        One such use is for `datasets`'s `map` feature which to be efficient should be run once on the main process,
        which upon completion saves a cached version of results and which then automatically gets loaded by the
        replicas.

        Args:
            local (`bool`, *optional*, defaults to `True`):
                if `True` first means process of rank 0 of each node if `False` first means process of rank 0 of node
                rank 0 In multi-node environment with a shared filesystem you most likely will want to use
                `local=False` so that only the main process of the first node will do the processing. If however, the
                filesystem is not shared, then the main process of each node will need to do the processing, which is
                the default behavior.
            desc (`str`, *optional*, defaults to `"work"`):
                a work description to be used in debug logs

        """
peter-sk commented 9 months ago

Would this work in a setting without a shared file system?

snowyday commented 9 months ago

I'v checked it on a GPU cluster with a shared file system. For multi-node setups with independent file systems, the default save_on_each_node=True is fine; main_process_first make sure to serialize the execution for each node. If that still doesn't work, then I think there might still be an issue with main_process_first.

tblattner commented 9 months ago

I don't think there is an issue with main_process_first as I've been using it across a lot of dataset processing steps.

I believe that on network/shared file systems os.rename is not atomic. So its possible that the file system in this case might not be reflected after os.rename returns, causing other processes to observe the wrong state. I haven't found a good way to ensure the rename is completed. Catching the exception would handle it though, but not my ideal way to deal with the race condition.

snowyday commented 9 months ago

In the case of processes sharing a filesystem, it seems prudent for only one process to wait for a rename operation to complete. However, why main_process_first is being used? On a shared filesystem, if the rename() fails, options are limited. Is this why multiple processes are making repeated attempts?

tblattner commented 9 months ago

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.

My suggestion is to use something like this: if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:

Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.

This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...

It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

muellerzr commented 9 months ago

It is, so we could have a race condition. An fsync could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?

mjbommar commented 9 months ago

FYI, we tested and also experienced this without shared FS (accelerate/pdsh, simple two-node setup).

Also, if we rely on full fsync implementation in checkpoint folder, it might be good to explicitly call that out in docs as not all filesystems/mount options will fail hard on "fake" fsync calls.

tblattner commented 9 months ago

It is, so we could have a race condition. An fsync could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?

I can get a start on a PR. Not sure what the best methodology for running fsync on a rename operation is, but I'll give it a shot.

yuleiqin commented 9 months ago

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.

My suggestion is to use something like this: if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:

Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.

This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...

It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see: [Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856.

MaxGonzalezSaez-Diez commented 8 months ago

any updates?

ArthurZucker commented 8 months ago

This was fixed by the PR I believe !

snowyday commented 7 months ago

A similar error has now occurred at L.2561 89c6481

I am experiencing this issue in a distributed training environment that utilizes a shared file system across 16 nodes, with each node equipped with 4 GPUs. I'm deploying the training using DeepSpeed's OpenMPI launcher.

In this setup, I have observed scenarios where the cleanup command shutil.rmtree(staging_output_dir) at L.2561 in the code fails to execute due to the condition self.is_local_process_zero() not being met on the slave nodes. This is intended to "Clean up the remaining staging checkpoint folders on other nodes," but it does not always work as expected.

File "XXX/transformers/src/transformers/trainer.py", line 2561, in _save_checkpoint
    shutil.rmtree(staging_output_dir)

File "XXX/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)

FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)

FileNotFoundError: FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'

FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'[Errno 2] No such file or directory: 'rng_state_6.pth'

[89c6481]

        # Then go through the rewriting process, only renaming and rotating from main process(es)
        if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            if staging_output_dir != output_dir:
                if os.path.exists(staging_output_dir):
                    try:
                        os.rename(staging_output_dir, output_dir)
                    except Exception as e:
                        logger.error(
                            f"Error occurred when attempting to rename checkpoint folder: {e}\n"
                            "The checkpoint folder will not be renamed, but the training will proceed."
                        )

                    # Ensure rename completed in cases where os.rename is not atomic
                    # And can only happen on non-windows based systems
                    if os.name != "nt":
                        fd = os.open(output_dir, os.O_RDONLY)
                        os.fsync(fd)
                        os.close(fd)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                # Solely rely on numerical checkpoint id for rotation.
                # mtime is not reliable especially on some fuse fs in cloud environments.
                self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir) @L.2561

        self.args.distributed_state.wait_for_everyone()

Although os.path.exists(staging_output_dir) is used for verification, it seems that staging_output_dir does not exist when shutil.rmtree(staging_output_dir) is executed. It looks like a try-except block needs to be implemented here as well.

            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                try:
                    shutil.rmtree(staging_output_dir) @L.2561
                except Exception as e:
                     logger.error(
                            f"Error occurred when attempting to delete checkpoint folder: {e}\n"
                        )

                  if os.name != "nt":
                      fd = os.open(staging_output_dir, os.O_RDONLY)
                      os.fsync(fd)
                      os.close(fd)
amyeroberts commented 7 months ago

Hi @snowyday - could you open a new issue, including all these details and linking to this issue? This way we can better track what's been addressed and what's a new issue

chercheurkg commented 7 months ago

Hello @amyeroberts & @snowyday , I just wanted to share that I have encountered almost similar issue while using transformer 4.37.0 on Windows 10 (as admin) with single GPU. The error I got read as follows:

\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint fd = os.open(output_dir, os.O_RDONLY) PermissionError: [Errno 13] Permission denied: '.

amyeroberts commented 7 months ago

Hi @chercheurkg, have you tried on the latest release? There was a patch release for 4.37 which should have addressed this.

chercheurkg commented 7 months ago

@amyeroberts , Thanks for your reply! As per your suggestion, on the same machine, I used transformer version 4.37. However, it did not work for me. I got the same error.

amyeroberts commented 7 months ago

Ah, sorry, wasn't clear, I meant to use either 4.37.2 or 4.38.1

DreamInvoker commented 7 months ago

in my case, 4.38.2 also faces this issue. when upgraded to 4.37.2 on all nodes, it gets fixed.

amyeroberts commented 7 months ago

@DreamInvoker Could you try running on main? pip install git+https://github.com/huggingface/transformers

yuzhms commented 7 months ago

I also meet the same problem in 4,38.2. Using the 4.37,2 fix this issue.

tic-top commented 7 months ago

I have tried the latest version v3.40.0 with overwrite_output_dir=False . Everything works well.

I'm working on 4 nodes(32 GPU) sharing the same filesystem. When using v4.39.0, No such file or directory: 'model/tmp-checkpoint-100' -> 'model/checkpoint-100' occurs. After turning to v4.37.2 I encounter a new problem.

My first setting is shown below.

        do_train=True,
        do_eval=False,
        save_strategy="steps",
        save_steps=100
        save_total_limit=5
        overwrite_output_dir=True

The model stop saving the ckpt after 900 although my global step is 1300. image

Then I train a new model with overwrite_output_dir=False, image

zhenyuhe00 commented 7 months ago

same issue

FileNotFoundError: [Errno 2] No such file or directory:

ArthurZucker commented 6 months ago

Did you try with transformers==4.39.1?

ruian1 commented 2 months ago

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here. My suggestion is to use something like this: if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process: Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards. This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there... It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see: [Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856.

Hi, were you able to get rid of this error? Thanks

azuryl commented 1 month ago
staging_output_dir = output_dir

staging_output_dir = output_dir

solanki-ravi commented 1 month ago

@ArthurZucker Can the AWS HuggingFace DL containers be updated as well? Currently Training Images are using Transformers 4.36.0 and impacted by this issue (i.e. All Training Jobs using Distributed Training with checkpoints are failing with this error, see log below).

Existing HuggingFace DL Container Images: https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers

Transformer Version: PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

HuggingFace Trainer on Sagemaker Logs:

ErrorMessage "FileNotFoundErrorFileNotFoundErrorFileNotFoundError: : : [Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'
 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 2903/2903 [2:35:27<00:00,  3.21s/it]
 [2024-08-30 09:19:24,433] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65 closing signal SIGTERM
 [2024-08-30 09:19:24,997] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 63) of binary: /opt/conda/bin/python
 Traceback (most recent call last)
 File "/opt/conda/bin/torchrun", line 33, in <module>
 sys.exit(load_entry_point('torch==2.1.0', 'console_scripts', 'torchrun')())
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
 return f(*args, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
 run(args)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
 elastic_launch(
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
 return launch_agent(self._config, self._entrypoint, list(args))
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
 raise ChildFailedError(
 torch.distributed.elastic.multiprocessing.errors.ChildFailedError
 ============================================================
 train_fsdp.py FAILED
 ------------------------------------------------------------
 Failures
 [1]
 time      : 2024-08-30_09:19:24
 host      : algo-3
 rank      : 9 (local_rank: 1)
 exitcode  : 1 (pid: 64)
 error_file: <N/A>
 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
 [2]
 rank      : 11 (local_rank: 3)
 exitcode  : 1 (pid: 66)
 Root Cause (first observed failure)
 [0]
 rank      : 8 (local_rank: 0)
 exitcode  : 1 (pid: 63)"