open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.43k stars 9.43k forks source link

Multi-gpu training get stacked at the first epoch #8379

Open Icecream-blue-sky opened 2 years ago

Icecream-blue-sky commented 2 years ago

I customize EpochBasedRunner to input different pictures under different control conditions. This is my change:

StitchEpochBasedRunner.py
def train(self, data_loader, **kwargs):
        self.model.train()
        self.mode = 'train'
        self.data_loader_regular = data_loader[0]
        data_loader_regular_iter = iter(self.data_loader_regular)
        self.data_loader_unregular = data_loader[1]
        data_loader_unregular_iter = iter(self.data_loader_unregular)
        self.data_loader = self.data_loader_regular
        self._max_iters = self._max_epochs * len(self.data_loader)
        self.call_hook('before_train_epoch')
        time.sleep(2)  # Prevent possible deadlock during epoch transition
        # under dist training, len(data_batch) = samplers_per_gpu * num_gpus
        if not hasattr(self, "ratio_small"):
            self.ratio_small = 0.0
        for i in range(len(self.data_loader)):
            if self.ratio_small < self.stitch_thresh:
                # use stitch input
                data_batch = next(data_loader_unregular_iter)
            else:
                data_batch = next(data_loader_regular_iter)
            self._inner_iter = i
            self.call_hook('before_train_iter')
            self.run_iter(data_batch, train_mode=True, **kwargs)
            self.call_hook('after_train_iter')
            self._iter += 1

        self.call_hook('after_train_epoch')
        self._epoch += 1

As you can see, I define two seperate dataloaders to input different images. I can train normally on single gpu. But when I train on multi-gpu, it get stacked at the first epoch. When one epoch ends, the dataloader, the two dataloaders will not necessarily be iterated over. I don't know whether it is a cause of my problem.

2022-07-19 10:34:59,355 - mmdet - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) StepLrUpdaterHook                  
(NORMAL      ) StitchDistSamplerSeedHook          
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) StepLrUpdaterHook                  
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(LOW         ) DistEvalHook                       
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(NORMAL      ) StitchDistSamplerSeedHook          
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2022-07-19 10:34:59,356 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2022-07-19 10:34:59,356 - mmdet - INFO - Checkpoints will be saved to 
2022-07-19 10:39:34,734 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
2022-07-19 10:39:40,640 - mmdet - INFO - Epoch [1][10/173]      lr: 9.991e-05, eta: 0:33:32, time: 0.974, data_time: 0.366, memory: 3743, loss_cls_pos: 1.2167, loss_loc: 1.0404, loss_cls_neg: 0.2963, loss: 2.5535
2022-07-19 10:39:46,590 - mmdet - INFO - Epoch [1][20/173]      lr: 1.998e-04, eta: 0:26:54, time: 0.596, data_time: 0.057, memory: 3765, loss_cls_pos: 1.4346, loss_loc: 1.0611, loss_cls_neg: 0.0309, loss: 2.5265
2022-07-19 10:39:52,411 - mmdet - INFO - Epoch [1][30/173]      lr: 2.997e-04, eta: 0:24:28, time: 0.583, data_time: 0.059, memory: 3765, loss_cls_pos: 1.2396, loss_loc: 1.0163, loss_cls_neg: 0.1322, loss: 2.3881
2022-07-19 10:39:58,263 - mmdet - INFO - Epoch [1][40/173]      lr: 3.996e-04, eta: 0:23:15, time: 0.589, data_time: 0.049, memory: 3765, loss_cls_pos: 1.2351, loss_loc: 0.9365, loss_cls_neg: 0.0640, loss: 2.2355
2022-07-19 10:40:03,726 - mmdet - INFO - Epoch [1][50/173]      lr: 4.995e-04, eta: 0:22:12, time: 0.546, data_time: 0.059, memory: 3795, loss_cls_pos: 1.2171, loss_loc: 0.9125, loss_cls_neg: 0.0981, loss: 2.2277
2022-07-19 10:40:09,842 - mmdet - INFO - Epoch [1][60/173]      lr: 5.994e-04, eta: 0:21:50, time: 0.611, data_time: 0.047, memory: 3795, loss_cls_pos: 1.1973, loss_loc: 0.8528, loss_cls_neg: 0.0869, loss: 2.1370
2022-07-19 10:40:15,588 - mmdet - INFO - Epoch [1][70/173]      lr: 6.993e-04, eta: 0:21:22, time: 0.575, data_time: 0.044, memory: 3795, loss_cls_pos: 1.1586, loss_loc: 0.7953, loss_cls_neg: 0.0840, loss: 2.0379
2022-07-19 10:40:20,915 - mmdet - INFO - Epoch [1][80/173]      lr: 7.992e-04, eta: 0:20:49, time: 0.532, data_time: 0.054, memory: 3795, loss_cls_pos: 1.1260, loss_loc: 0.8029, loss_cls_neg: 0.0772, loss: 2.0061
2022-07-19 10:40:26,728 - mmdet - INFO - Epoch [1][90/173]      lr: 8.991e-04, eta: 0:20:32, time: 0.580, data_time: 0.044, memory: 3872, loss_cls_pos: 1.1049, loss_loc: 0.8055, loss_cls_neg: 0.0878, loss: 1.9982
2022-07-19 10:40:32,350 - mmdet - INFO - Epoch [1][100/173]     lr: 9.990e-04, eta: 0:20:15, time: 0.563, data_time: 0.044, memory: 3872, loss_cls_pos: 1.0540, loss_loc: 0.7553, loss_cls_neg: 0.0880, loss: 1.8973
2022-07-19 10:40:38,558 - mmdet - INFO - Epoch [1][110/173]     lr: 1.099e-03, eta: 0:20:09, time: 0.616, data_time: 0.051, memory: 3872, loss_cls_pos: 1.0888, loss_loc: 0.7567, loss_cls_neg: 0.0703, loss: 1.9158
2022-07-19 10:40:43,759 - mmdet - INFO - Epoch [1][120/173]     lr: 1.199e-03, eta: 0:19:46, time: 0.511, data_time: 0.045, memory: 3872, loss_cls_pos: 1.0190, loss_loc: 0.7143, loss_cls_neg: 0.0795, loss: 1.8128
2022-07-19 10:40:50,363 - mmdet - INFO - Epoch [1][130/173]     lr: 1.299e-03, eta: 0:19:50, time: 0.674, data_time: 0.064, memory: 3872, loss_cls_pos: 0.9927, loss_loc: 0.6856, loss_cls_neg: 0.0855, loss: 1.7638
2022-07-19 10:40:55,980 - mmdet - INFO - Epoch [1][140/173]     lr: 1.399e-03, eta: 0:19:35, time: 0.548, data_time: 0.035, memory: 3872, loss_cls_pos: 0.9711, loss_loc: 0.6598, loss_cls_neg: 0.0725, loss: 1.7033
2022-07-19 10:41:02,187 - mmdet - INFO - Epoch [1][150/173]     lr: 1.499e-03, eta: 0:19:31, time: 0.626, data_time: 0.074, memory: 3872, loss_cls_pos: 0.9494, loss_loc: 0.6867, loss_cls_neg: 0.0766, loss: 1.7127
2022-07-19 10:41:07,718 - mmdet - INFO - Epoch [1][160/173]     lr: 1.598e-03, eta: 0:19:20, time: 0.563, data_time: 0.057, memory: 3872, loss_cls_pos: 0.9103, loss_loc: 0.6144, loss_cls_neg: 0.0717, loss: 1.5965
2022-07-19 10:41:13,126 - mmdet - INFO - Epoch [1][170/173]     lr: 1.698e-03, eta: 0:19:06, time: 0.541, data_time: 0.060, memory: 3872, loss_cls_pos: 0.9272, loss_loc: 0.6205, loss_cls_neg: 0.0771, loss: 1.6248

I try to find where it get stucked. So I set ipdb.set_trace at CheckpointHook:

 def after_train_epoch(self, runner):
        if not self.by_epoch:
            return

        # save checkpoint for following cases:
        # 1. every ``self.interval`` epochs
        # 2. reach the last epoch of training
        if self.every_n_epochs(
                runner, self.interval) or (self.save_last
                                           and self.is_last_epoch(runner)):
            runner.logger.info(
                f'Saving checkpoint at {runner.epoch + 1} epochs')
            if self.sync_buffer:
                allreduce_params(runner.model.buffers())
            self._save_checkpoint(runner)

    @master_only
    def _save_checkpoint(self, runner):
        """Save the current checkpoint and delete unwanted checkpoint."""
        runner.save_checkpoint(
            self.out_dir, save_optimizer=self.save_optimizer, **self.args)
        if runner.meta is not None:
            if self.by_epoch:
                cur_ckpt_filename = self.args.get(
                    'filename_tmpl', 'epoch_{}.pth').format(runner.epoch + 1)
            else:
                cur_ckpt_filename = self.args.get(
                    'filename_tmpl', 'iter_{}.pth').format(runner.iter + 1)
            runner.meta.setdefault('hook_msgs', dict())
            runner.meta['hook_msgs']['last_ckpt'] = self.file_client.join_path(
                self.out_dir, cur_ckpt_filename)

        #  stop here
        ipdb.set_trace()
        # remove other checkpoints
        if self.max_keep_ckpts > 0:
            if self.by_epoch:
                name = 'epoch_{}.pth'
                current_ckpt = runner.epoch + 1
            else:
                name = 'iter_{}.pth'
                current_ckpt = runner.iter + 1
            redundant_ckpts = range(
                current_ckpt - self.max_keep_ckpts * self.interval, 0,
                -self.interval)
            filename_tmpl = self.args.get('filename_tmpl', name)
            for _step in redundant_ckpts:
                ckpt_path = self.file_client.join_path(
                    self.out_dir, filename_tmpl.format(_step))
                if self.file_client.isfile(ckpt_path):
                    self.file_client.remove(ckpt_path)
                else:
                    break

Surprisingly, the first epoch can be saved successfully (it was not possible before). However, it get stucked again after this. Any advice to this problem?

Icecream-blue-sky commented 2 years ago

Here's more detailed information. When I set time.sleep(2) after training one epoch:

StitchEpochBasedRunner.py
def train(self, data_loader, **kwargs):
        self.model.train()
        self.mode = 'train'
        self.data_loader_regular = data_loader[0]
        data_loader_regular_iter = iter(self.data_loader_regular)
        self.data_loader_unregular = data_loader[1]
        data_loader_unregular_iter = iter(self.data_loader_unregular)
        self.data_loader = self.data_loader_regular
        self._max_iters = self._max_epochs * len(self.data_loader)
        self.call_hook('before_train_epoch')
        time.sleep(2)  # Prevent possible deadlock during epoch transition
        # under dist training, len(data_batch) = samplers_per_gpu * num_gpus
        if not hasattr(self, "ratio_small"):
            self.ratio_small = 0.0
        for i in range(len(self.data_loader)):
            if self.ratio_small < self.stitch_thresh:
                # use stitch input
                data_batch = next(data_loader_unregular_iter)
            else:
                data_batch = next(data_loader_regular_iter)
            self._inner_iter = i
            self.call_hook('before_train_iter')
            self.run_iter(data_batch, train_mode=True, **kwargs)
            self.call_hook('after_train_iter')
            self._iter += 1

        # sleep here
        time.sleep(2)
        self.call_hook('after_train_epoch')
        self._epoch += 1

I also can save checkpoints. But the error is different:

2022-07-19 11:09:27,724 - mmdet - INFO - Saving checkpoint at 1 epochs
[                                                  ] 0/8331, elapsed: 0s, ETA:/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/datasets/pipelines/formating.py:7: UserWarnin
[E ProcessGroupNCCL.cpp:566] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809763 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809779 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809527 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809976 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809989 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809975 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809984 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809779 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809763 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809527 milliseconds before timing out.
Traceback (most recent call last):
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809976 milliseconds before timing out.
/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
    return old_func(*args, **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 175, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 148, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/single_stage.py", line 103, in simple_test
    feat, img_metas, rescale=rescale)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/base_dense_head.py", line 361, in simple_test
    return self.simple_test_bboxes(feats, img_metas, rescale=rescale)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/dense_test_mixins.py", line 36, in simple_test_bboxes
    outs = self.forward(feats)
  File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 122, in forward
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809989 milliseconds before timing out.
/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
    return old_func(*args, **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 175, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 148, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/single_stage.py", line 103, in simple_test
    feat, img_metas, rescale=rescale)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/base_dense_head.py", line 361, in simple_test
    return self.simple_test_bboxes(feats, img_metas, rescale=rescale)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/dense_test_mixins.py", line 36, in simple_test_bboxes
    outs = self.forward(feats)
  File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 122, in forward
    self.strides)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/core/utils/misc.py", line 30, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 129, in forward_single
    bbox_pred = F.relu(bbox_pred)
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1298, in relu
    result = torch.relu(input)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "stitcher/train.py", line 194, in <module>
    main()
  File "stitcher/train.py", line 190, in main
    meta=meta)
  File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/train_net.py", line 401, in train_detector
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809975 milliseconds before timing out.
/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 40 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 110, in new_func
    return old_func(*args, **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 175, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/base.py", line 148, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/detectors/single_stage.py", line 103, in simple_test
    feat, img_metas, rescale=rescale)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/base_dense_head.py", line 361, in simple_test
    return self.simple_test_bboxes(feats, img_metas, rescale=rescale)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/models/dense_heads/dense_test_mixins.py", line 36, in simple_test_bboxes
    outs = self.forward(feats)
  File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 122, in forward
    self.strides)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/core/utils/misc.py", line 30, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/dw_head.py", line 129, in forward_single
    bbox_pred = F.relu(bbox_pred)
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1298, in relu
    result = torch.relu(input)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "stitcher/train.py", line 194, in <module>
    main()
  File "stitcher/train.py", line 190, in main
    meta=meta)
  File "/gpfs/home/sist/tqzouustc/code/DW/stitcher/train_net.py", line 401, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/stitch_epoch_based_runner.py", line 149, in run
    # self.train or self.val
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/stitch_epoch_based_runner.py", line 74, in train
    time.sleep(2) 
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/sist/tqzouustc/.conda/envs/openmmlab_a100_latest/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
    self._do_evaluate(runner)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/core/evaluation/eval_hooks.py", line 121, in _do_evaluate
    gpu_collect=self.gpu_collect)
  File "/gpfs/home/sist/tqzouustc/code/mmdetection_latest/mmdet/apis/test.py", line 135, in multi_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/sist/tqzouustc/.local/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 27630) is killed by signal: Interrupt. 
Icecream-blue-sky commented 2 years ago

anybody?