open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.61k stars 9.47k forks source link

CUDA Error #7409

Closed s0tt closed 2 years ago

s0tt commented 2 years ago

Describe the bug Hello, when running Faster R-CNN training on my available GPU cluster i get the above errors. I got the info from this thread (#4335 ) that apparently it is a CUDA/torch/mmcv mismatch. I double checked everything and compare my versions:

Torch was installed with official command for 11.3 so torch/cuda should match:

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

So i currently do not see how this goes wrong? Thanks for your help!

Reproduction

  1. What command or script did you run? ./train.py configs/faster_rcnn/faster_rcnn_r50_fpn_1x_lvis.py --gpus 1

  2. What dataset did you use? LVIS v0.5

Environment Currently Loaded Modulefiles: 1) conda/4.11.0(default) 2) cuda/11.3 3) cudnn/11.3_v8.2

Key: (symbolic-version) =auto-loaded
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Mar_21_19:15:46_PDT_2021 Cuda compilation tools, release 11.3, V11.3.58 Build cuda_11.3.r11.3/compiler.29745058_0 sys.platform: linux Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] CUDA available: True GPU 0: Tesla V100-SXM2-32GB CUDA_HOME: /apps/cuda/11.3 NVCC: Build cuda_11.3.r11.3/compiler.29745058_0 GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4) PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.12.0 OpenCV: 4.5.5 MMCV: 1.4.6 MMCV Compiler: GCC 8.5 MMCV CUDA Compiler: 11.3 MMDetection: 2.22.0+b612b5c

Error traceback If applicable, paste the error trackback here.

Traceback (most recent call last):
  File "./tools/train.py", line 200, in <module>
    main()
  File "./tools/train.py", line 190, in main
    train_detector(
  File "/user/git/mmdetection/mmdet/apis/train.py", line 208, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/user/git/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/user/git/mmdetection/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/user/.conda/mmdet2_22_0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/git/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func
    return old_func(*args, **kwargs)
  File "/user/git/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/user/git/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/user/git/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 339, in forward_train
    proposal_list = self.get_bboxes(
  File "/user/git/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func
    return old_func(*args, **kwargs)
  File "/user/git/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 102, in get_bboxes
    results = self._get_bboxes_single(cls_score_list, bbox_pred_list,
  File "/user/git/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 185, in _get_bboxes_single
    return self._bbox_post_process(mlvl_scores, mlvl_bbox_preds,
  File "/user/git/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 231, in _bbox_post_process
    dets, _ = batched_nms(proposals, scores, ids, cfg.nms)
  File "/user/git/mmcv/mmcv/ops/nms.py", line 326, in batched_nms
    dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
  File "/user/git/mmcv/mmcv/utils/misc.py", line 340, in new_func
    output = old_func(*args, **kwargs)
  File "/user/git/mmcv/mmcv/ops/nms.py", line 172, in nms
    inds = NMSop.apply(boxes, scores, iou_threshold, offset,
  File "/user/git/mmcv/mmcv/ops/nms.py", line 26, in forward
    inds = ext_module.nms(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Bug fix

hhaAndroid commented 2 years ago

@s0tt We have not verified the pytorch1.11.0 version. Please @zhouzaida to check it.