CUDA Error - Githubissues

Describe the bug Hello, when running Faster R-CNN training on my available GPU cluster i get the above errors. I got the info from this thread (#4335 ) that apparently it is a CUDA/torch/mmcv mismatch. I double checked everything and compare my versions:

torch 1.11.0
CUDA 11.3, cudatoolkit 11.3.1
mmdet 2.22.0, mmcv 1.4.6 Which should match apparently to: https://mmdetection.readthedocs.io/en/latest/get_started.html

Torch was installed with official command for 11.3 so torch/cuda should match:

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

So i currently do not see how this goes wrong? Thanks for your help!

Reproduction

What command or script did you run? ./train.py configs/faster_rcnn/faster_rcnn_r50_fpn_1x_lvis.py --gpus 1
What dataset did you use? LVIS v0.5

Environment Currently Loaded Modulefiles: 1) conda/4.11.0(default) 2) cuda/11.3 3) cudnn/11.3_v8.2

Key: (symbolic-version) =auto-loaded
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Mar_21_19:15:46_PDT_2021 Cuda compilation tools, release 11.3, V11.3.58 Build cuda_11.3.r11.3/compiler.29745058_0 sys.platform: linux Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] CUDA available: True GPU 0: Tesla V100-SXM2-32GB CUDA_HOME: /apps/cuda/11.3 NVCC: Build cuda_11.3.r11.3/compiler.29745058_0 GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4) PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0 OpenCV: 4.5.5 MMCV: 1.4.6 MMCV Compiler: GCC 8.5 MMCV CUDA Compiler: 11.3 MMDetection: 2.22.0+b612b5c

Error traceback If applicable, paste the error trackback here.

Traceback (most recent call last):
  File "./tools/train.py", line 200, in <module>
    main()
  File "./tools/train.py", line 190, in main
    train_detector(
  File "/user/git/mmdetection/mmdet/apis/train.py", line 208, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/user/git/mmcv/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/user/git/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/user/git/mmdetection/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/user/.conda/mmdet2_22_0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/git/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func
    return old_func(*args, **kwargs)
  File "/user/git/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/user/git/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/user/git/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 339, in forward_train
    proposal_list = self.get_bboxes(
  File "/user/git/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func
    return old_func(*args, **kwargs)
  File "/user/git/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 102, in get_bboxes
    results = self._get_bboxes_single(cls_score_list, bbox_pred_list,
  File "/user/git/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 185, in _get_bboxes_single
    return self._bbox_post_process(mlvl_scores, mlvl_bbox_preds,
  File "/user/git/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 231, in _bbox_post_process
    dets, _ = batched_nms(proposals, scores, ids, cfg.nms)
  File "/user/git/mmcv/mmcv/ops/nms.py", line 326, in batched_nms
    dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
  File "/user/git/mmcv/mmcv/utils/misc.py", line 340, in new_func
    output = old_func(*args, **kwargs)
  File "/user/git/mmcv/mmcv/ops/nms.py", line 172, in nms
    inds = NMSop.apply(boxes, scores, iou_threshold, offset,
  File "/user/git/mmcv/mmcv/ops/nms.py", line 26, in forward
    inds = ext_module.nms(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

open-mmlab / mmdetection

CUDA Error #7409

Bug fix