Open ffrancesco94 opened 3 weeks ago
When you individually run on the machine via docker or something, does import torch; print(torch.cuda.is_available())
print True
?
I can run the multigpu slurm script via the Singularity image and it correctly offloads to all the 4 GPUs. Moreover, there are GPUs on the login node and importing torch there and printing the devices correctly finds them.
On Thu, 31 Oct 2024, 13:55 Zach Mueller, @.***> wrote:
When you individually run on the machine via docker or something, does import torch; print(torch.cuda.is_available()) print True?
— Reply to this email directly, view it on GitHub https://github.com/huggingface/accelerate/issues/3206#issuecomment-2449778997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGGGVZKEXPNFOO2UVJGZKYTZ6ISFVAVCNFSM6AAAAABQ6B4D6OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBZG43TQOJZG4 . You are receiving this because you authored the thread.Message ID: @.***>
This is likely because we have multiple rank 0 machines. Can you adapt the script to properly use the $MACHINE_RANK
of the node?
I tried to add the --machine_rank $SLURM_PROCID --role $SLURMD_NAME:
flags and if I use two GPUs on two nodes it dies with the following traceback:
load SINGULARITY/3.11.5 (PATH)
The following values were not passed to `accelerate launch` and had defaults used instead:
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
The following values were not passed to `accelerate launch` and had defaults used instead:
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py:494: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/usr/local/lib/python3.12/dist-packages/accelerate/accelerator.py:494: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
/usr/local/lib/python3.12/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'mrpc' at /gpfs/projects/ehpc63/hf/datasets/glue/mrpc/0.0.0/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (last modified on Wed Oct 30 13:33:14 2024).
Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'mrpc' at /gpfs/projects/ehpc63/hf/datasets/glue/mrpc/0.0.0/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (last modified on Wed Oct 30 13:33:14 2024).
Using the latest cached version of the module from /gpfs/projects/ehpc63/hf/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed (last modified on Wed Oct 30 11:19:29 2024) since it couldn't be found locally at evaluate-metric--glue, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /gpfs/projects/ehpc63/hf/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed (last modified on Wed Oct 30 11:19:29 2024) since it couldn't be found locally at evaluate-metric--glue, or remotely on the Hugging Face Hub.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rank0]: Traceback (most recent call last):
[rank0]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank0]: main()
[rank0]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank0]: training_function(config, args)
[rank0]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 250, in training_function
[rank0]: eval_metric = metric.compute()
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 481, in compute
[rank0]: os.remove(file_path)
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/projects/ehpc63/hf/metrics/glue/mrpc/default_experiment-1-0.arrow'
[rank0]:[W1031 15:43:27.809708975 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank1]:[E1031 15:43:27.765128949 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1737, last enqueued NCCL work: 1737, last completed NCCL work: 1736.
[rank1]: Traceback (most recent call last):
[rank1]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank1]: main()
[rank1]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank1]: training_function(config, args)
[rank1]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 213, in training_function
[rank1]: for step, batch in enumerate(active_dataloader):
[rank1]: File "/usr/local/lib/python3.12/dist-packages/accelerate/data_loader.py", line 543, in __iter__
[rank1]: synchronize_rng_states(self.rng_types, self.synchronized_generator)
[rank1]: File "/usr/local/lib/python3.12/dist-packages/accelerate/utils/random.py", line 132, in synchronize_rng_states
[rank1]: synchronize_rng_state(RNGType(rng_type), generator=generator)
[rank1]: File "/usr/local/lib/python3.12/dist-packages/accelerate/utils/random.py", line 127, in synchronize_rng_state
[rank1]: generator.set_state(rng_state)
[rank1]: RuntimeError: Invalid mt19937 state
E1031 15:43:28.165000 140052551848064 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3813380) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
complete_nlp_example.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-31_15:43:28
host : as07r5b12.bsc.mn
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3813380)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[rank1]:[E1031 15:43:28.253652198 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1031 15:43:28.253667742 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1031 15:43:28.253706214 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgress: Connection closed by remote peer as07r5b12-ib0<55568>
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb051577f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7fb0037c81e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7fb0037c842c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fb0037cf313 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb0037d171c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xecdb4 (0x7fb05d944db4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x9ca94 (0x7fb1daba9a94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x129c3c (0x7fb1dac36c3c in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgress: Connection closed by remote peer as07r5b12-ib0<55568>
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1892 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb051577f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptr<c10d::NCCLComm>&) + 0x220 (0x7fb0037c81e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7fb0037c842c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7fb0037cf313 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb0037d171c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xecdb4 (0x7fb05d944db4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x9ca94 (0x7fb1daba9a94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x129c3c (0x7fb1dac36c3c in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb051577f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fb00345aa84 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xecdb4 (0x7fb05d944db4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9ca94 (0x7fb1daba9a94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c3c (0x7fb1dac36c3c in /lib/x86_64-linux-gnu/libc.so.6)
E1031 15:43:28.381000 140589415776384 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 1081813) of binary: /usr/bin/python
W1031 15:43:28.397000 140589415776384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'as07r5b15.bsc.mn_1081727_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
W1031 15:43:28.401000 140589415776384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'as07r5b15.bsc.mn_1081727_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
W1031 15:43:28.409000 140589415776384 torch/distributed/elastic/rendezvous/dynamic_rendezvous.py:1218] The node 'as07r5b15.bsc.mn_1081727_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1165, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
complete_nlp_example.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-31_15:43:28
host : as07r5b15.bsc.mn
rank : 1 (local_rank: 0)
exitcode : -6 (pid: 1081813)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1081813
========================================================
srun: error: as07r5b12: task 0: Exited with exit code 1
srun: Terminating StepId=11066094.0
What's interesting is that the output in some parts is duplicated. I'm starting to think that maybe I need something MPI in the container...
Not sure. From what I can see, the root of the error is that it tries to delete this default_experiment-1-0.arrow
without finding it? I actually see that the corresponding .lock file is created. Not sure if what comes later descends from that or not...
Update: this seems related to https://github.com/huggingface/evaluate/issues/382, where two scripts are trying to compute the same metric. Not sure if I can skip caching, but it's quite interesting that it only happens when I enforce GPUs and not when it was falling back to CPUs...
Further updates: the complete_nlp_example.py
is broken in the sense that the evaluate.load()
call should include the process_id
and num_process
to realise it is working in a distributed environment. Once that is fixed, the processes die due to https://github.com/huggingface/evaluate/issues/542 this bug (https://github.com/huggingface/evaluate/issues/481 this is related). I have tried with different combinations of datasets, evaluate and filelock versions without any success... Are there any fixes for those? Once this gets solved I could submit a PR with the right slurm scripts and calls in the complete_nlp_example
script. Traceback attached here:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
The following values were not passed to `accelerate launch` and had defaults used instead:
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[rank1]:[W1102 21:00:57.360673584 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1102 21:00:57.749390097 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rank1]: Traceback (most recent call last):
[rank1]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank1]: main()
[rank1]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank1]: training_function(config, args)
[rank1]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 245, in training_function
[rank1]: metric.add_batch(
[rank1]: File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 510, in add_batch
[rank1]: self._init_writer()
[rank1]: File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 659, in _init_writer
[rank1]: self._check_rendez_vous() # wait for master to be ready and to let everyone go
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 362, in _check_rendez_vous
[rank1]: raise ValueError(
[rank1]: ValueError: Expected to find locked file /gpfs/projects/ehpc63/hf/metrics/glue/mrpc/default_experiment-2-0.arrow.lock from process 1 but it doesn't exist.
[rank0]: Traceback (most recent call last):
[rank0]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 325, in <module>
[rank0]: main()
[rank0]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 321, in main
[rank0]: training_function(config, args)
[rank0]: File "/gpfs/projects/ehpc63/tests/accelerate_test/complete_nlp_example.py", line 245, in training_function
[rank0]: metric.add_batch(
[rank0]: File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 510, in add_batch
[rank0]: self._init_writer()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 656, in _init_writer
[rank0]: self._check_all_processes_locks() # wait for everyone to be ready
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/evaluate/module.py", line 350, in _check_all_processes_locks
[rank0]: raise ValueError(
[rank0]: ValueError: Expected to find locked file /gpfs/projects/ehpc63/hf/metrics/glue/mrpc/default_experiment-2-1.arrow.lock from process 0 but it doesn't exist.
E1102 21:03:02.757000 2414310 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2414397) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
complete_nlp_example.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-02_21:03:02
host : as01r1b03.bsc.mn
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2414397)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
E1102 21:03:02.785000 922486 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 922572) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.12/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
complete_nlp_example.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-02_21:03:02
host : as01r1b09.bsc.mn
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 922572)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: as01r1b03: task 0: Exited with exit code 1
srun: Terminating StepId=11137338.0
srun: error: as01r1b09: task 1: Exited with exit code 1
I fixed it by gathering the samples on the main process and computing metrics only on the main process. Can I submit a PR with a working slurm script and the updated example?
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I ran the
complete_nlp_example.py
through thesubmit_multinode.sh
submit script. At first, the script runs without errors, but parallelises on the CPUs rather than GPUs. To solve that, I passed the--multi_gpu
flag toaccelerate launch
, but that would crash with an error related toargs.machine_rank
being ofNoneType
. To get around it, I added--machine_rank 0
to the launch command. This would run, but after training the first epoch, it crashes with the following traceback:This looks like some communication problem, can it be that I need some MPI in the container (even though it's being launched with srun)? The script can correctly address 8 GPUs over two nodes. The cluster is Marenostrum 5 in BSC if it matters (the submit scripts seem to fit that cluster, I thought that maybe they were developed there).
Expected behavior
The script runs without errors.