Hi, I'm experiencing this error when I'm doing a multi-gpus training session, what could be the cause of this?
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=673327, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800164 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=673327, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800164 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64733 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64736 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64738 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 64737) of binary: /opt/conda/envs/pillarnext_env/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/pillarnext_env/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Hi, I'm experiencing this error when I'm doing a multi-gpus training session, what could be the cause of this?
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=673327, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800164 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=673327, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800164 milliseconds before timing out. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64733 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64736 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64738 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 64737) of binary: /opt/conda/envs/pillarnext_env/bin/python Traceback (most recent call last): File "/opt/conda/envs/pillarnext_env/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pillarnext_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: