Timeout error during training to reproduce results

Hi,

I have been trying to train the model for Imagenet dataset using 16 V100 GPUs. I am getting a timeout error during evaluation in the training script after the first epoch. It occurs exactly at the same point [2100/11010] iteration in evaluation. Any idea as to why this is occurring?

STACK TRACE:

Test: [ 2090/10010] eta: 1:53:32 acc1: 73.3203 (76.6279) ema_acc1: 77.7734 (76.4956) time: 0.8568 data: 0.0
001 max mem: 14135 [E ProcessGroupNCCL.cpp:587] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800172 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:587] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801901 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801862 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801679 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802073 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801667 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801821 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802399 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802025 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1802703 milliseconds before timing out.
Test: [ 2100/10010] eta: 1:53:23 acc1: 79.5312 (76.6506) ema_acc1: 81.9922 (76.5298) time: 0.8566 data: 0.0
001 max mem: 14135
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1546 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1549 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1553 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1554 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1555 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1556 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1558 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1562 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1563 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1564 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1565 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1566 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1570 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 12 (pid: 1567) of binary: /
nfs/users/ext_jameel.hassan/anaconda3/envs/must/bin/python
Traceback (most recent call last): File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 194, in _run_module_as_mai n
return _run_code(code, main_globals, None,
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 723, in
main()
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/elastic/mu
ltiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 719, in main
run(args)
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/run.py", l
ine 710, in run
elastic_launch( File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/nfs/users/ext_jameel.hassan/anaconda3/envs/must/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(torch.distributed.elastic.multiprocessing.errors.ChildFailedError: =====================================================
train.py FAILED

Failures:
[1]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 15 (local_rank: 15)
exitcode : -6 (pid: 1575)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1575

Root Cause (first observed failure):
[0]:
time : 2023-01-02_15:24:36
host : p4-r66-a.g42cloud.net
rank : 12 (local_rank: 12)
exitcode : -6 (pid: 1567)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1567
=====================================================

salesforce / MUST

Timeout error during training to reproduce results #7