Open Icecream-blue-sky opened 2 years ago
Here's more detailed information. When I set time.sleep(2) after training one epoch:
StitchEpochBasedRunner.py
def train(self, data_loader, **kwargs):
self.model.train()
self.mode = 'train'
self.data_loader_regular = data_loader[0]
data_loader_regular_iter = iter(self.data_loader_regular)
self.data_loader_unregular = data_loader[1]
data_loader_unregular_iter = iter(self.data_loader_unregular)
self.data_loader = self.data_loader_regular
self._max_iters = self._max_epochs * len(self.data_loader)
self.call_hook('before_train_epoch')
time.sleep(2) # Prevent possible deadlock during epoch transition
# under dist training, len(data_batch) = samplers_per_gpu * num_gpus
if not hasattr(self, "ratio_small"):
self.ratio_small = 0.0
for i in range(len(self.data_loader)):
if self.ratio_small < self.stitch_thresh:
# use stitch input
data_batch = next(data_loader_unregular_iter)
else:
data_batch = next(data_loader_regular_iter)
self._inner_iter = i
self.call_hook('before_train_iter')
self.run_iter(data_batch, train_mode=True, **kwargs)
self.call_hook('after_train_iter')
self._iter += 1
# sleep here
time.sleep(2)
self.call_hook('after_train_epoch')
self._epoch += 1
I also can save checkpoints. But the error is different:
2022-07-19 11:09:27,724 - mmdet - INFO - Saving checkpoint at 1 epochs
[ ] 0/8331, elapsed: 0s, ETA:/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/datasets/pipelines/formating.py:7: UserWarnin
[E ProcessGroupNCCL.cpp:566] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809763 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809779 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809527 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809976 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809989 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809975 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809984 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809779 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809763 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809527 milliseconds before timing out.
Traceback (most recent call last):
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809976 milliseconds before timing out.
/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 175, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 148, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/single_stage.py", line 103, in simple_test
feat, img_metas, rescale=rescale)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/base_dense_head.py", line 361, in simple_test
return self.simple_test_bboxes(feats, img_metas, rescale=rescale)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/dense_test_mixins.py", line 36, in simple_test_bboxes
outs = self.forward(feats)
File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 122, in forward
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809989 milliseconds before timing out.
/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 175, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 148, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/single_stage.py", line 103, in simple_test
feat, img_metas, rescale=rescale)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/base_dense_head.py", line 361, in simple_test
return self.simple_test_bboxes(feats, img_metas, rescale=rescale)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/dense_test_mixins.py", line 36, in simple_test_bboxes
outs = self.forward(feats)
File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 122, in forward
self.strides)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 129, in forward_single
bbox_pred = F.relu(bbox_pred)
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1298, in relu
result = torch.relu(input)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "stitcher/train.py", line 194, in <module>
main()
File "stitcher/train.py", line 190, in main
meta=meta)
File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/train_net.py", line 401, in train_detector
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809975 milliseconds before timing out.
/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 175, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 148, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/single_stage.py", line 103, in simple_test
feat, img_metas, rescale=rescale)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/base_dense_head.py", line 361, in simple_test
return self.simple_test_bboxes(feats, img_metas, rescale=rescale)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/dense_test_mixins.py", line 36, in simple_test_bboxes
outs = self.forward(feats)
File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 122, in forward
self.strides)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 129, in forward_single
bbox_pred = F.relu(bbox_pred)
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1298, in relu
result = torch.relu(input)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "stitcher/train.py", line 194, in <module>
main()
File "stitcher/train.py", line 190, in main
meta=meta)
File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/train_net.py", line 401, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/stitch_epoch_based_runner.py", line 149, in run
# self.train or self.val
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/stitch_epoch_based_runner.py", line 74, in train
time.sleep(2)
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
self._do_evaluate(runner)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/core/evaluation/eval_hooks.py", line 121, in _do_evaluate
gpu_collect=self.gpu_collect)
File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/apis/test.py", line 135, in multi_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 27630) is killed by signal: Interrupt.
anybody?
I customize EpochBasedRunner to input different pictures under different control conditions. This is my change:
As you can see, I define two seperate dataloaders to input different images. I can train normally on single gpu. But when I train on multi-gpu, it get stacked at the first epoch. When one epoch ends, the dataloader, the two dataloaders will not necessarily be iterated over. I don't know whether it is a cause of my problem.
I try to find where it get stucked. So I set ipdb.set_trace at CheckpointHook:
Surprisingly, the first epoch can be saved successfully (it was not possible before). However, it get stucked again after this. Any advice to this problem?