Multinode, multigpu example fails

ffrancesco94 commented 3 weeks ago

System Info

Accelerate 0.34.2
Numpy 1.26.4
(Singularity container based on Ubuntu 22.04)

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

I ran the complete_nlp_example.py through the submit_multinode.sh submit script. At first, the script runs without errors, but parallelises on the CPUs rather than GPUs. To solve that, I passed the --multi_gpu flag to accelerate launch, but that would crash with an error related to args.machine_rank being of NoneType. To get around it, I added --machine_rank 0 to the launch command. This would run, but after training the first epoch, it crashes with the following traceback:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank6]:     main()
[rank6]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank6]:     training_function(config, args)
[rank6]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 250, in training_function
[rank6]:     eval_metric = metric.compute()
[rank6]:                   ^^^^^^^^^^^^^^^^
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 456, in compute
[rank6]:     self._finalize()
[rank6]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 406, in _finalize
[rank6]:     raise ValueError(
[rank6]: ValueError: Error in finalize: another evaluation module instance is already using the local cache file. Please specify an experiment_id to avoid collision between distributed evaluation module instances.
[rank0]:[E1031 11:33:14.593046631 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 438, last enqueued NCCL work: 451, last completed NCCL work: 437.
W1031 11:33:16.784000 140252156977280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 563065 closing signal SIGTERM
W1031 11:33:16.784000 140252156977280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 563066 closing signal SIGTERM
W1031 11:33:16.784000 140252156977280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 563068 closing signal SIGTERM
E1031 11:33:17.203000 140252156977280 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 2 (pid: 563067) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
complete_nlp_example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-31_11:33:16
  host      : as02r3b28.bsc.mn
  rank      : 6 (local_rank: 2)
  exitcode  : 1 (pid: 563067)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: as02r3b28: task 1: Exited with exit code 1
srun: Terminating StepId=11053141.0
[rank1]:[E1031 11:33:17.117623236 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 440, last enqueued NCCL work: 451, last completed NCCL work: 439.
[rank2]:[E1031 11:33:17.121871912 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 440, last enqueued NCCL work: 451, last completed NCCL work: 439.
[rank3]:[E1031 11:33:17.232383987 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 440, last enqueued NCCL work: 451, last completed NCCL work: 439.
slurmstepd: error: *** STEP 11053141.0 ON as02r3b26 CANCELLED AT 2024-10-31T11:33:17 ***
W1031 11:33:17.950000 139643958321280 torch/distributed/elastic/agent/server/api.py:688] Received 15 death signal, shutting down workers
W1031 11:33:17.950000 139643958321280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 509707 closing signal SIGTERM
W1031 11:33:17.950000 139643958321280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 509708 closing signal SIGTERM
W1031 11:33:17.950000 139643958321280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 509709 closing signal SIGTERM
W1031 11:33:17.950000 139643958321280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 509710 closing signal SIGTERM
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run
    time.sleep(monitor_interval)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 509683 got signal: 15
srun: error: as02r3b26: task 0: Exited with exit code 1

This looks like some communication problem, can it be that I need some MPI in the container (even though it's being launched with srun)? The script can correctly address 8 GPUs over two nodes. The cluster is Marenostrum 5 in BSC if it matters (the submit scripts seem to fit that cluster, I thought that maybe they were developed there).

Expected behavior

The script runs without errors.

muellerzr commented 3 weeks ago

When you individually run on the machine via docker or something, does import torch; print(torch.cuda.is_available()) print True?

ffrancesco94 commented 3 weeks ago

I can run the multigpu slurm script via the Singularity image and it correctly offloads to all the 4 GPUs. Moreover, there are GPUs on the login node and importing torch there and printing the devices correctly finds them.

On Thu, 31 Oct 2024, 13:55 Zach Mueller, @.***> wrote:

When you individually run on the machine via docker or something, does import torch; print(torch.cuda.is_available()) print True?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/accelerate/issues/3206#issuecomment-2449778997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGGGVZKEXPNFOO2UVJGZKYTZ6ISFVAVCNFSM6AAAAABQ6B4D6OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBZG43TQOJZG4 . You are receiving this because you authored the thread.Message ID: @.***>

muellerzr commented 3 weeks ago

This is likely because we have multiple rank 0 machines. Can you adapt the script to properly use the $MACHINE_RANK of the node?

ffrancesco94 commented 3 weeks ago

I tried to add the --machine_rank $SLURM_PROCID --role $SLURMD_NAME: flags and if I use two GPUs on two nodes it dies with the following traceback:

load SINGULARITY/3.11.5 (PATH)
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py:494: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py:494: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'mrpc' at /gpfs/projects/ehpc63/hf/datasets/glue/mrpc/0.0.0/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (last modified on Wed Oct 30 13:33:14 2024).
Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'mrpc' at /gpfs/projects/ehpc63/hf/datasets/glue/mrpc/0.0.0/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (last modified on Wed Oct 30 13:33:14 2024).
Using the latest cached version of the module from /gpfs/projects/ehpc63/hf/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed (last modified on Wed Oct 30 11:19:29 2024) since it couldn't be found locally at evaluate-metric--glue, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /gpfs/projects/ehpc63/hf/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed (last modified on Wed Oct 30 11:19:29 2024) since it couldn't be found locally at evaluate-metric--glue, or remotely on the Hugging Face Hub.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank0]:     main()
[rank0]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank0]:     training_function(config, args)
[rank0]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 250, in training_function
[rank0]:     eval_metric = metric.compute()
[rank0]:                   ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 481, in compute
[rank0]:     os.remove(file_path)
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/projects/ehpc63/hf/metrics/glue/mrpc/default_experiment-1-0.arrow'
[rank0]:[W1031 15:43:27.809708975 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank1]:[E1031 15:43:27.765128949 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1737, last enqueued NCCL work: 1737, last completed NCCL work: 1736.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank1]:     main()
[rank1]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank1]:     training_function(config, args)
[rank1]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 213, in training_function
[rank1]:     for step, batch in enumerate(active_dataloader):
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/data_loader.py", line 543, in __iter__
[rank1]:     synchronize_rng_states(self.rng_types, self.synchronized_generator)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/utils/random.py", line 132, in synchronize_rng_states
[rank1]:     synchronize_rng_state(RNGType(rng_type), generator=generator)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/utils/random.py", line 127, in synchronize_rng_state
[rank1]:     generator.set_state(rng_state)
[rank1]: RuntimeError: Invalid mt19937 state
E1031 15:43:28.165000 140052551848064 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3813380) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
complete_nlp_example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-31_15:43:28
  host      : as07r5b12.bsc.mn
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3813380)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[rank1]:[E1031 15:43:28.253652198 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1031 15:43:28.253667742 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1031 15:43:28.253706214 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgress: Connection closed by remote peer as07r5b12-ib0<55568>
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb051577f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7fb0037c81e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7fb0037c842c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fb0037cf313 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb0037d171c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xecdb4 (0x7fb05d944db4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x9ca94 (0x7fb1daba9a94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x129c3c (0x7fb1dac36c3c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgress: Connection closed by remote peer as07r5b12-ib0<55568>
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb051577f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7fb0037c81e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7fb0037c842c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fb0037cf313 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb0037d171c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xecdb4 (0x7fb05d944db4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x9ca94 (0x7fb1daba9a94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x129c3c (0x7fb1dac36c3c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb051577f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fb00345aa84 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7fb05d944db4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x7fb1daba9a94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7fb1dac36c3c in /lib/x86_64-linux-gnu/libc.so.6)

E1031 15:43:28.381000 140589415776384 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 1081813) of binary: /usr/bin/python
W1031 15:43:28.397000 140589415776384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'as07r5b15.bsc.mn_1081727_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
W1031 15:43:28.401000 140589415776384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'as07r5b15.bsc.mn_1081727_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
W1031 15:43:28.409000 140589415776384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'as07r5b15.bsc.mn_1081727_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
complete_nlp_example.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-31_15:43:28
  host      : as07r5b15.bsc.mn
  rank      : 1 (local_rank: 0)
  exitcode  : -6 (pid: 1081813)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1081813
========================================================
srun: error: as07r5b12: task 0: Exited with exit code 1
srun: Terminating StepId=11066094.0

What's interesting is that the output in some parts is duplicated. I'm starting to think that maybe I need something MPI in the container...

muellerzr commented 3 weeks ago

Might be https://github.com/huggingface/accelerate/issues/934#issuecomment-1451319158?

ffrancesco94 commented 3 weeks ago

Not sure. From what I can see, the root of the error is that it tries to delete this default_experiment-1-0.arrow without finding it? I actually see that the corresponding .lock file is created. Not sure if what comes later descends from that or not...

ffrancesco94 commented 3 weeks ago

Update: this seems related to https://github.com/huggingface/evaluate/issues/382, where two scripts are trying to compute the same metric. Not sure if I can skip caching, but it's quite interesting that it only happens when I enforce GPUs and not when it was falling back to CPUs...

ffrancesco94 commented 2 weeks ago

Further updates: the complete_nlp_example.py is broken in the sense that the evaluate.load() call should include the process_id and num_process to realise it is working in a distributed environment. Once that is fixed, the processes die due to https://github.com/huggingface/evaluate/issues/542 this bug (https://github.com/huggingface/evaluate/issues/481 this is related). I have tried with different combinations of datasets, evaluate and filelock versions without any success... Are there any fixes for those? Once this gets solved I could submit a PR with the right slurm scripts and calls in the complete_nlp_example script. Traceback attached here:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[rank1]:[W1102 21:00:57.360673584 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1102 21:00:57.749390097 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank1]:     main()
[rank1]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank1]:     training_function(config, args)
[rank1]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 245, in training_function
[rank1]:     metric.add_batch(
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 510, in add_batch
[rank1]:     self._init_writer()
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 659, in _init_writer
[rank1]:     self._check_rendez_vous()  # wait for master to be ready and to let everyone go
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 362, in _check_rendez_vous
[rank1]:     raise ValueError(
[rank1]: ValueError: Expected to find locked file /gpfs/projects/ehpc63/hf/metrics/glue/mrpc/default_experiment-2-0.arrow.lock from process 1 but it doesn't exist.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank0]:     main()
[rank0]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank0]:     training_function(config, args)
[rank0]:   File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 245, in training_function
[rank0]:     metric.add_batch(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 510, in add_batch
[rank0]:     self._init_writer()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 656, in _init_writer
[rank0]:     self._check_all_processes_locks()  # wait for everyone to be ready
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 350, in _check_all_processes_locks
[rank0]:     raise ValueError(
[rank0]: ValueError: Expected to find locked file /gpfs/projects/ehpc63/hf/metrics/glue/mrpc/default_experiment-2-1.arrow.lock from process 0 but it doesn't exist.
E1102 21:03:02.757000 2414310 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2414397) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
complete_nlp_example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-02_21:03:02
  host      : as01r1b03.bsc.mn
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2414397)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
E1102 21:03:02.785000 922486 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 922572) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
complete_nlp_example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-02_21:03:02
  host      : as01r1b09.bsc.mn
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 922572)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: as01r1b03: task 0: Exited with exit code 1
srun: Terminating StepId=11137338.0
srun: error: as01r1b09: task 1: Exited with exit code 1

ffrancesco94 commented 2 weeks ago

I fixed it by gathering the samples on the main process and computing metrics only on the main process. Can I submit a PR with a working slurm script and the updated example?

huggingface / accelerate