[SSD] Training fails due to NCCL timeout

rakshithvasudev commented 2 years ago

Hello All,

Thanks for providing the ssd implementation with resnext. I'm running with a batch size of 16 with the run_and_time.sh script without slurm on a single node. It seems to have NCCL issues, any idea to resolve this? Seems like an all reduce timeout.

Running ...
:::MLLOG {"namespace": "", "time_ms": 1650562923867, "event_type": "INTERVAL_START", "key": "epoch_start", "value": 0, "metadata": {"file": "engine.py", "lineno": 15, "epoch_num": 0}}
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Epoch: [0]  [   0/9143]  eta: 17:27:10  lr: 0.000000  loss: 2.2542 (2.2542)  bbox_regression: 0.7124 (0.7124)  classification: 1.5418 (1.5418)  time: 6.8720  data: 2.9338  max mem: 26239
Epoch: [0]  [  20/9143]  eta: 2:13:02  lr: 0.000000  loss: 2.2190 (2.2581)  bbox_regression: 0.7114 (0.7129)  classification: 1.5125 (1.5452)  time: 0.5751  data: 0.0003  max mem: 26671
Epoch: [0]  [  40/9143]  eta: 1:50:03  lr: 0.000001  loss: 2.2631 (2.2590)  bbox_regression: 0.7127 (0.7142)  classification: 1.5398 (1.5448)  time: 0.5683  data: 0.0003  max mem: 26673
Epoch: [0]  [  60/9143]  eta: 1:41:34  lr: 0.000001  loss: 2.2726 (2.2627)  bbox_regression: 0.7162 (0.7164)  classification: 1.5411 (1.5463)  time: 0.5593  data: 0.0003  max mem: 26675
Epoch: [0]  [  80/9143]  eta: 1:37:17  lr: 0.000001  loss: 2.2647 (2.2644)  bbox_regression: 0.7154 (0.7158)  classification: 1.5372 (1.5486)  time: 0.5625  data: 0.0003  max mem: 26677
Epoch: [0]  [ 100/9143]  eta: 1:34:33  lr: 0.000001  loss: 2.2604 (2.2628)  bbox_regression: 0.7172 (0.7166)  classification: 1.5318 (1.5462)  time: 0.5595  data: 0.0003  max mem: 26677
Epoch: [0]  [ 120/9143]  eta: 1:33:01  lr: 0.000001  loss: 2.2130 (2.2589)  bbox_regression: 0.7024 (0.7139)  classification: 1.5148 (1.5450)  time: 0.5740  data: 0.0003  max mem: 26710
Epoch: [0]  [ 140/9143]  eta: 1:31:34  lr: 0.000002  loss: 2.1325 (2.2445)  bbox_regression: 0.6886 (0.7101)  classification: 1.4484 (1.5344)  time: 0.5598  data: 0.0003  max mem: 26710
Epoch: [0]  [ 160/9143]  eta: 1:30:27  lr: 0.000002  loss: 2.0146 (2.2180)  bbox_regression: 0.6585 (0.7045)  classification: 1.3607 (1.5136)  time: 0.5614  data: 0.0003  max mem: 26787
Epoch: [0]  [ 180/9143]  eta: 1:29:30  lr: 0.000002  loss: 1.9003 (2.1860)  bbox_regression: 0.6490 (0.6983)  classification: 1.2550 (1.4877)  time: 0.5594  data: 0.0003  max mem: 26787
Epoch: [0]  [ 200/9143]  eta: 1:28:59  lr: 0.000002  loss: 1.8187 (2.1506)  bbox_regression: 0.6311 (0.6921)  classification: 1.2009 (1.4585)  time: 0.5777  data: 0.0003  max mem: 26787
Epoch: [0]  [ 220/9143]  eta: 1:28:15  lr: 0.000003  loss: 1.7182 (2.1125)  bbox_regression: 0.6240 (0.6862)  classification: 1.1075 (1.4262)  time: 0.5568  data: 0.0003  max mem: 26787
Epoch: [0]  [ 240/9143]  eta: 1:27:35  lr: 0.000003  loss: 1.6671 (2.0756)  bbox_regression: 0.6079 (0.6804)  classification: 1.0519 (1.3952)  time: 0.5562  data: 0.0003  max mem: 26787
Epoch: [0]  [ 260/9143]  eta: 1:27:06  lr: 0.000003  loss: 1.6425 (2.0424)  bbox_regression: 0.6106 (0.6751)  classification: 1.0362 (1.3673)  time: 0.5645  data: 0.0003  max mem: 26787
Epoch: [0]  [ 280/9143]  eta: 1:26:37  lr: 0.000003  loss: 1.6008 (2.0109)  bbox_regression: 0.6034 (0.6701)  classification: 0.9903 (1.3407)  time: 0.5619  data: 0.0003  max mem: 26787
Epoch: [0]  [ 300/9143]  eta: 1:26:10  lr: 0.000003  loss: 1.5591 (1.9815)  bbox_regression: 0.5976 (0.6655)  classification: 0.9659 (1.3161)  time: 0.5596  data: 0.0003  max mem: 26787
Epoch: [0]  [ 320/9143]  eta: 1:25:42  lr: 0.000004  loss: 1.5461 (1.9548)  bbox_regression: 0.5874 (0.6608)  classification: 0.9624 (1.2940)  time: 0.5547  data: 0.0003  max mem: 26787
Epoch: [0]  [ 340/9143]  eta: 1:25:18  lr: 0.000004  loss: 1.4913 (1.9283)  bbox_regression: 0.5774 (0.6558)  classification: 0.9294 (1.2725)  time: 0.5588  data: 0.0003  max mem: 26787
Epoch: [0]  [ 360/9143]  eta: 1:24:53  lr: 0.000004  loss: 1.4813 (1.9036)  bbox_regression: 0.5763 (0.6512)  classification: 0.9040 (1.2524)  time: 0.5554  data: 0.0003  max mem: 26787
Epoch: [0]  [ 380/9143]  eta: 1:24:35  lr: 0.000004  loss: 1.4774 (1.8811)  bbox_regression: 0.5606 (0.6467)  classification: 0.9116 (1.2345)  time: 0.5646  data: 0.0003  max mem: 26890
Epoch: [0]  [ 400/9143]  eta: 1:24:12  lr: 0.000004  loss: 1.4581 (1.8596)  bbox_regression: 0.5539 (0.6422)  classification: 0.8923 (1.2174)  time: 0.5531  data: 0.0003  max mem: 26890
Epoch: [0]  [ 420/9143]  eta: 1:23:51  lr: 0.000005  loss: 1.4297 (1.8392)  bbox_regression: 0.5495 (0.6378)  classification: 0.8829 (1.2014)  time: 0.5560  data: 0.0003  max mem: 26890
Epoch: [0]  [ 440/9143]  eta: 1:23:31  lr: 0.000005  loss: 1.4216 (1.8201)  bbox_regression: 0.5464 (0.6336)  classification: 0.8709 (1.1865)  time: 0.5556  data: 0.0002  max mem: 26890
Epoch: [0]  [ 460/9143]  eta: 1:23:13  lr: 0.000005  loss: 1.4190 (1.8028)  bbox_regression: 0.5431 (0.6300)  classification: 0.8715 (1.1728)  time: 0.5582  data: 0.0003  max mem: 26890
Epoch: [0]  [ 480/9143]  eta: 1:22:54  lr: 0.000005  loss: 1.4042 (1.7863)  bbox_regression: 0.5432 (0.6264)  classification: 0.8590 (1.1598)  time: 0.5552  data: 0.0003  max mem: 26890
Epoch: [0]  [ 500/9143]  eta: 1:22:38  lr: 0.000006  loss: 1.3647 (1.7697)  bbox_regression: 0.5295 (0.6226)  classification: 0.8351 (1.1471)  time: 0.5586  data: 0.0003  max mem: 26890
Epoch: [0]  [ 520/9143]  eta: 1:22:20  lr: 0.000006  loss: 1.3576 (1.7541)  bbox_regression: 0.5250 (0.6190)  classification: 0.8350 (1.1351)  time: 0.5540  data: 0.0003  max mem: 26890
Epoch: [0]  [ 540/9143]  eta: 1:22:03  lr: 0.000006  loss: 1.3429 (1.7389)  bbox_regression: 0.5236 (0.6155)  classification: 0.8139 (1.1234)  time: 0.5568  data: 0.0003  max mem: 26890
Epoch: [0]  [ 560/9143]  eta: 1:21:46  lr: 0.000006  loss: 1.3481 (1.7249)  bbox_regression: 0.5270 (0.6123)  classification: 0.8199 (1.1126)  time: 0.5554  data: 0.0003  max mem: 26890
Epoch: [0]  [ 580/9143]  eta: 1:21:30  lr: 0.000006  loss: 1.3164 (1.7110)  bbox_regression: 0.5223 (0.6091)  classification: 0.8045 (1.1019)  time: 0.5542  data: 0.0003  max mem: 26890
Epoch: [0]  [ 600/9143]  eta: 1:21:15  lr: 0.000007  loss: 1.3029 (1.6976)  bbox_regression: 0.5122 (0.6059)  classification: 0.7838 (1.0917)  time: 0.5596  data: 0.0003  max mem: 26890
Epoch: [0]  [ 620/9143]  eta: 1:21:00  lr: 0.000007  loss: 1.2984 (1.6848)  bbox_regression: 0.5059 (0.6027)  classification: 0.7921 (1.0821)  time: 0.5576  data: 0.0003  max mem: 26890
Epoch: [0]  [ 640/9143]  eta: 1:20:45  lr: 0.000007  loss: 1.2836 (1.6724)  bbox_regression: 0.5107 (0.5998)  classification: 0.7715 (1.0725)  time: 0.5577  data: 0.0003  max mem: 26890
Epoch: [0]  [ 660/9143]  eta: 1:20:31  lr: 0.000007  loss: 1.2806 (1.6606)  bbox_regression: 0.5081 (0.5971)  classification: 0.7773 (1.0635)  time: 0.5590  data: 0.0003  max mem: 26890
Epoch: [0]  [ 680/9143]  eta: 1:20:16  lr: 0.000008  loss: 1.2850 (1.6564)  bbox_regression: 0.5167 (0.5946)  classification: 0.7727 (1.0618)  time: 0.5563  data: 0.0003  max mem: 26890
Epoch: [0]  [ 700/9143]  eta: 1:20:02  lr: 0.000008  loss: 1.2730 (1.6453)  bbox_regression: 0.5039 (0.5919)  classification: 0.7650 (1.0534)  time: 0.5543  data: 0.0003  max mem: 26890
Epoch: [0]  [ 720/9143]  eta: 1:19:48  lr: 0.000008  loss: 1.2712 (1.6349)  bbox_regression: 0.4970 (0.5894)  classification: 0.7661 (1.0454)  time: 0.5609  data: 0.0003  max mem: 26890
Epoch: [0]  [ 740/9143]  eta: 1:19:34  lr: 0.000008  loss: 1.2646 (1.6247)  bbox_regression: 0.5001 (0.5870)  classification: 0.7610 (1.0377)  time: 0.5570  data: 0.0003  max mem: 26890
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808094 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808974 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809147 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809144 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809146 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809149 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809150 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809152 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(args)
  File "train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 2.  Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808094 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(args)
  File "train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809144 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(args)
  File "train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 4.  Original reason for failure was: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809150 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(args)
  File "train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 7.  Original reason for failure was: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809147 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(args)
  File "train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
   self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 7.  Original reason for failure was: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809147 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(args)
  File "train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    main(args)
  File "train.py", line 235, in main
        train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)self._sync_params()

  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 3.  Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809152 milliseconds before timing out.
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809146 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::runtime_errorterminate called after throwing an instance of 'terminate called after throwing an instance of 'std::runtime_errorterminate called after throwing an instance of 'std::runtime_error'
std::runtime_errorstd::runtime_error'
std::runtime_error'
'
'
'
  what():    what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809144 milliseconds before timing out.[Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809150 milliseconds before timing out.
  what():    what():
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809147 milliseconds before timing out.[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809146 milliseconds before timing out.[Rank 2] Watchdog caught collective operation timeout: WorkNCCL(
OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808094 milliseconds before timing out.
  what():

[Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809152 milliseconds before timing out.
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    main(args)
  File "train.py", line 235, in main
    train_one_epoch(model, optimizer, scaler, data_loader, device, epoch, args)
  File "/workspace/single_stage_detector/ssd/engine.py", line 35, in train_one_epoch
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1382, in _sync_params
    authoritative_rank,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1335, in _distributed_broadcast_coalesced
    self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: NCCL communicator was aborted on rank 5.  Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808974 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808974 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 609 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 610 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 612 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 613 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 614 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 607) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
  )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
train.py FAILED
----------------------------------------------------
Failures:
[1]:
  time      : 2022-04-21_18:19:47
  host      : be87a6f1b062
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 608)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 608
[2]:
  time      : 2022-04-21_18:19:47
  host      : be87a6f1b062
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 611)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 611
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-21_18:19:47
  host      : be87a6f1b062
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 607)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 607
====================================================

Thanks in advance!

ahmadki commented 2 years ago

Hi @rakshithvasudev

Can you provide more information about the system you are using:

Number of GPUs + model
Nvidia driver version
CUDA version (or if you are using docker, the base image)

Can you also see if the error happens with NGC containers (for example nvcr.io/nvidia/pytorch:22.03-py3)

Thank you

rakshithvasudev commented 2 years ago

Hello @ahmadki

Thanks for your response. Here are answers to your questions:

There are 8xA100 PCIe 40GB on the server. The model is SSD with the resnext50_32x4d config is as follows:

Namespace(amp=True, backbone='resnext50_32x4d', batch_size=16, data_augmentation='hflip', data_layout='channels_last', data_path='/datasets/open-images-v6-mlperf', dataset='openimages-mlperf', device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, epochs=8, eval_batch_s
ize=16, eval_print_freq=20, gpu=0, image_size=[800, 800], lr=0.0001, output_dir='/results', pretrained=False, print_freq=20, rank=0, resume='', seed=931998635, start_epoch=0, sync_bn=False, target_map=0.34, test_only=False, trainable_backbone_layers=3, warmup_epochs=1, warmup_factor=0.00
1, workers=4, world_size=8)

Driver Version: 470.42.01, CUDA Version: 11.4 (on host)

I understand for my host driver settings nvcr.io/nvidia/pytorch:21.10-py3 is a compatible NGC container that doesn't need backward compatibility. That's why I tried the image that was given in the dockerfile, the image you mentioned above and the image I deemed compatible. They are as follows :

1. pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel
2. nvcr.io/nvidia/pytorch:22.03-py3
3. nvcr.io/nvidia/pytorch:21.10-py3

I'm able to reproduce the same error on all of the three docker base images mentioned here. Any help would be appreciated.

rakshithvasudev commented 2 years ago

Furthermore, I did do nccl-tests and didn't see any errors running them.

ahmadki commented 2 years ago

Thank you for providing the information. Using NGC containers helps us narrow down the issue, so let's keep using them.

First, please make sure you are using the --ipc=host flag in your docker run command:

docker run --rm -it \
  --gpus=all \
  --ipc=host \
  -v <HOST_DATA_DIR>:/datasets/open-images-v6-mlperf \
  mlperf/single_stage_detector bash

This is to insures the processes inside your container have enough shared memory for communications. You can run: df -h | grep shm from inside the container to make sure there is enough memory.

If the above flag is already in use, please set the following debug environment variables and attach the logs:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1

rakshithvasudev commented 2 years ago

Thanks @ahmadki!

I was able to run to completion. I actually had --ipc=host initially too. Not sure what the issue was. I exited my container and restarted everything. It failed first time I launched it, when I tried for the second time, it ran to completion.

So since my objective was to just test and see if the ssd with openimage dataset works, I'm closing this issue.

For those curious which container I had running it was nvcr.io/nvidia/pytorch:21.10-py3

Launch again and see if it helps; if you have a similar issue like mine.

Epoch: [4]  [9040/9143]  eta: 0:01:00  lr: 0.000100  loss: 0.5669 (0.5804)  bbox_regression: 0.2549 (0.2566)  classification: 0.3171 (0.3238)  time: 0.5812  data: 0.0002  max mem: 27072
Epoch: [4]  [9060/9143]  eta: 0:00:48  lr: 0.000100  loss: 0.5765 (0.5804)  bbox_regression: 0.2603 (0.2566)  classification: 0.3166 (0.3238)  time: 0.5871  data: 0.0002  max mem: 27072
Epoch: [4]  [9080/9143]  eta: 0:00:36  lr: 0.000100  loss: 0.5845 (0.5804)  bbox_regression: 0.2584 (0.2566)  classification: 0.3273 (0.3238)  time: 0.5851  data: 0.0002  max mem: 27072
Epoch: [4]  [9100/9143]  eta: 0:00:25  lr: 0.000100  loss: 0.5873 (0.5804)  bbox_regression: 0.2580 (0.2566)  classification: 0.3248 (0.3238)  time: 0.5863  data: 0.0002  max mem: 27072
Epoch: [4]  [9120/9143]  eta: 0:00:13  lr: 0.000100  loss: 0.5802 (0.5805)  bbox_regression: 0.2578 (0.2566)  classification: 0.3307 (0.3239)  time: 0.5838  data: 0.0002  max mem: 27072
Epoch: [4]  [9140/9143]  eta: 0:00:01  lr: 0.000100  loss: 0.5604 (0.5804)  bbox_regression: 0.2499 (0.2566)  classification: 0.3144 (0.3238)  time: 0.5817  data: 0.0009  max mem: 27072
Epoch: [4]  [9142/9143]  eta: 0:00:00  lr: 0.000100  loss: 0.5603 (0.5804)  bbox_regression: 0.2461 (0.2566)  classification: 0.3144 (0.3238)  time: 0.5813  data: 0.0009  max mem: 27072
Epoch: [4] Total time: 1:29:09 (0.5851 s / it)
:::MLLOG {"namespace": "", "time_ms": 1651026630084, "event_type": "INTERVAL_END", "key": "epoch_stop", "value": 4, "metadata": {"file": "engine.py", "lineno": 60, "epoch_num": 4}}
:::MLLOG {"namespace": "", "time_ms": 1651026631099, "event_type": "INTERVAL_START", "key": "eval_start", "value": 4, "metadata": {"file": "engine.py", "lineno": 66, "epoch_num": 4}}
Test:  [  0/194]  eta: 0:13:48  model_time: 0.5314 (0.5314)  evaluator_time: 0.2267 (0.2267)  time: 4.2701  data: 3.4947  max mem: 27072
Test:  [ 20/194]  eta: 0:02:04  model_time: 0.3035 (0.3190)  evaluator_time: 0.2120 (0.2155)  time: 0.5385  data: 0.0002  max mem: 27072
Test:  [ 40/194]  eta: 0:01:36  model_time: 0.2878 (0.3081)  evaluator_time: 0.1955 (0.2173)  time: 0.5298  data: 0.0002  max mem: 27072
Test:  [ 60/194]  eta: 0:01:18  model_time: 0.2921 (0.3029)  evaluator_time: 0.1959 (0.2125)  time: 0.5082  data: 0.0002  max mem: 27072
Test:  [ 80/194]  eta: 0:01:04  model_time: 0.2910 (0.3013)  evaluator_time: 0.1930 (0.2081)  time: 0.5056  data: 0.0002  max mem: 27072
Test:  [100/194]  eta: 0:00:52  model_time: 0.2950 (0.2997)  evaluator_time: 0.2075 (0.2067)  time: 0.5075  data: 0.0002  max mem: 27072
Test:  [120/194]  eta: 0:00:40  model_time: 0.2902 (0.2992)  evaluator_time: 0.1863 (0.2044)  time: 0.5053  data: 0.0002  max mem: 27072
Test:  [140/194]  eta: 0:00:29  model_time: 0.2890 (0.2978)  evaluator_time: 0.1955 (0.2039)  time: 0.5036  data: 0.0002  max mem: 27072
Test:  [160/194]  eta: 0:00:18  model_time: 0.2915 (0.2972)  evaluator_time: 0.1988 (0.2037)  time: 0.5082  data: 0.0002  max mem: 27072
Test:  [180/194]  eta: 0:00:07  model_time: 0.2921 (0.2970)  evaluator_time: 0.2024 (0.2049)  time: 0.5243  data: 0.0002  max mem: 27072
Test:  [193/194]  eta: 0:00:00  model_time: 0.2885 (0.2958)  evaluator_time: 0.1834 (0.2030)  time: 0.4867  data: 0.0002  max mem: 27072
Test: Total time: 0:01:43 (0.5343 s / it)
Averaged stats: model_time: 0.2885 (0.2957)  evaluator_time: 0.1834 (0.2044)
Accumulating evaluation results...
DONE (t=92.93s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.34282
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.48390
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.36762
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.02334
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.11563
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.38066
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.40912
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.58227
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.60925
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.07781
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.29512
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.66112

:::MLLOG {"namespace": "", "time_ms": 1651027002780, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.34282287555785546, "metadata": {"file": "engine.py", "lineno": 108, "epoch_num": 4}}
:::MLLOG {"namespace": "", "time_ms": 1651027010611, "event_type": "INTERVAL_END", "key": "eval_stop", "value": 4, "metadata": {"file": "engine.py", "lineno": 109, "epoch_num": 4}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "INTERVAL_END", "key": "run_stop", "value": null, "metadata": {"file": "train.py", "lineno": 257, "status": "success"}}
Training time 8:00:35
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020658, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
:::MLLOG {"namespace": "", "time_ms": 1651027020657, "event_type": "POINT_IN_TIME", "key": "status", "value": "success", "metadata": {"file": "train.py", "lineno": 261}}
ENDING TIMING RUN AT 2022-04-27 02:37:32 AM
RESULT,SINGLE_STAGE_DETECTOR,,28890,nvidia,2022-04-26 06:36:02 PM

mlcommons / training

[SSD] Training fails due to NCCL timeout #546