Describe the bug
Hello,
when running Faster R-CNN training on my available GPU cluster i get the above errors.
I got the info from this thread (#4335 ) that apparently it is a CUDA/torch/mmcv mismatch.
I double checked everything and compare my versions:
Error traceback
If applicable, paste the error trackback here.
Traceback (most recent call last):
File "./tools/train.py", line 200, in <module>
main()
File "./tools/train.py", line 190, in main
train_detector(
File "/user/git/mmdetection/mmdet/apis/train.py", line 208, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/user/git/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/user/git/mmdetection/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/user/.conda/mmdet2_22_0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/user/git/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "/user/git/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/user/git/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
rpn_losses, proposal_list = self.rpn_head.forward_train(
File "/user/git/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 339, in forward_train
proposal_list = self.get_bboxes(
File "/user/git/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func
return old_func(*args, **kwargs)
File "/user/git/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 102, in get_bboxes
results = self._get_bboxes_single(cls_score_list, bbox_pred_list,
File "/user/git/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 185, in _get_bboxes_single
return self._bbox_post_process(mlvl_scores, mlvl_bbox_preds,
File "/user/git/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 231, in _bbox_post_process
dets, _ = batched_nms(proposals, scores, ids, cfg.nms)
File "/user/git/mmcv/mmcv/ops/nms.py", line 326, in batched_nms
dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
File "/user/git/mmcv/mmcv/utils/misc.py", line 340, in new_func
output = old_func(*args, **kwargs)
File "/user/git/mmcv/mmcv/ops/nms.py", line 172, in nms
inds = NMSop.apply(boxes, scores, iou_threshold, offset,
File "/user/git/mmcv/mmcv/ops/nms.py", line 26, in forward
inds = ext_module.nms(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Describe the bug Hello, when running Faster R-CNN training on my available GPU cluster i get the above errors. I got the info from this thread (#4335 ) that apparently it is a CUDA/torch/mmcv mismatch. I double checked everything and compare my versions:
Torch was installed with official command for 11.3 so torch/cuda should match:
So i currently do not see how this goes wrong? Thanks for your help!
Reproduction
What command or script did you run? ./train.py configs/faster_rcnn/faster_rcnn_r50_fpn_1x_lvis.py --gpus 1
What dataset did you use? LVIS v0.5
Environment Currently Loaded Modulefiles: 1) conda/4.11.0(default) 2) cuda/11.3 3) cudnn/11.3_v8.2
Key: (symbolic-version) =auto-loaded
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Mar_21_19:15:46_PDT_2021 Cuda compilation tools, release 11.3, V11.3.58 Build cuda_11.3.r11.3/compiler.29745058_0 sys.platform: linux Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] CUDA available: True GPU 0: Tesla V100-SXM2-32GB CUDA_HOME: /apps/cuda/11.3 NVCC: Build cuda_11.3.r11.3/compiler.29745058_0 GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4) PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:
TorchVision: 0.12.0 OpenCV: 4.5.5 MMCV: 1.4.6 MMCV Compiler: GCC 8.5 MMCV CUDA Compiler: 11.3 MMDetection: 2.22.0+b612b5c
Error traceback If applicable, paste the error trackback here.
Bug fix