RuntimeError: CUDA error: a PTX JIT compilation failed

farleylai commented 5 years ago

The error is thrown when mmdet is compiled on a user machine with Titan V (Pascal) and executes on a cluster worker machine with a newer GTX 1080 Ti.

Both machines have CUDA 10 installed. However, if compiled on a machine with RTX Titan, the execution on both machines with Titan V and GTX 1080 Ti is fine. After some cross-testing, it seems like this is the only failure case:

compiled on Titan X, execution on GTX 1080 Ti => fail
compiled on Titan X, execution on RTX Titan => pass
compiled on GTX 1080 Ti, execution on Titan X and RTX Titan => pass
compiled on RTX Titan, execution on Titan X and GTX 1080 Ti => pass

Any idea to address this failure case?

Update: it should be Titan X, not V.

Here is the minimal code to reproduce with GPU nms():

import torch
from mmdet.ops import nms
dets = torch.rand(2, 5).cuda()
nms(dets, 0.1)

The error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/zdata/users/farleylai/projects/mmdetection/mmdet/ops/nms/nms_wrapper.py", line 43, in nms
    inds = nms_cuda.nms(dets_th, iou_thr)
RuntimeError: CUDA error: a PTX JIT compilation failed (launch_kernel at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/ATen/native/cuda/Loops.cuh:72)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f8cb2caedc5 in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_index_kernel<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> const&) + 0x33e (0x7f8cb88ec13e in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: <unknown function> + 0x27e9cca (0x7f8cb88e7cca in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: <unknown function> + 0x27ea555 (0x7f8cb88e8555 in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: <unknown function> + 0x6cb2aa (0x7f8cb35932aa in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::native::index(at::Tensor const&, c10::ArrayRef<at::Tensor>) + 0x3e8 (0x7f8cb3590ed8 in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: at::TypeDefault::index(at::Tensor const&, c10::ArrayRef<at::Tensor>) const + 0x6c (0x7f8cb395ab0c in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #7: torch::autograd::VariableType::index(at::Tensor const&, c10::ArrayRef<at::Tensor>) const + 0x6f9 (0x7f8cab742289 in /home/ml/farleylai/Backups/miniconda3/envs/sinet36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #8: at::Tensor::index(c10::ArrayRef<at::Tensor>) const + 0x59 (0x7f8c8f5265af in /zdata/users/farleylai/projects/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #9: nms_cuda(at::Tensor, float) + 0x77b (0x7f8c8f524a59 in /zdata/users/farleylai/projects/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #10: nms(at::Tensor const&, float) + 0x221 (0x7f8c8f517f31 in /zdata/users/farleylai/projects/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0x22195 (0x7f8c8f523195 in /zdata/users/farleylai/projects/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #12: <unknown function> + 0x2228e (0x7f8c8f52328e in /zdata/users/farleylai/projects/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #13: <unknown function> + 0x1f79c (0x7f8c8f52079c in /zdata/users/farleylai/projects/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>
frame #29: __libc_start_main + 0xf0 (0x7f8ce75fd830 in /lib/x86_64-linux-gnu/libc.so.6)

PyTorch 1.1 from miniconda 3

farleylai commented 5 years ago

It seems like for some reason, the default nvcc compilation does not generate code support for GTX 1080 Ti on a Titan X machine. In that case, one may explicitly add the architecture and gencode options in the setup.py. Here is the example of mmdet/ops/nms/setup.py:

nvcc_ARCH  = ['-arch=sm_52']
nvcc_ARCH += ["-gencode=arch=compute_75,code=\"compute_75\""]
nvcc_ARCH += ["-gencode=arch=compute_75,code=\"sm_75\""]
nvcc_ARCH += ["-gencode=arch=compute_70,code=\"sm_70\""]
nvcc_ARCH += ["-gencode=arch=compute_61,code=\"sm_61\""]
nvcc_ARCH += ["-gencode=arch=compute_52,code=\"sm_52\""]
extra_compile_args = { 
            'cxx': ['-Wno-unused-function', '-Wno-write-strings'],
            'nvcc': nvcc_ARCH,}

Then specify the extra_compile_args composed from the above:

setup(
    name='nms_cuda',
    ext_modules=[
        CUDAExtension('nms_cuda', [
            'src/nms_cuda.cpp',
            'src/nms_kernel.cu',
        ],
        extra_compile_args=extra_compile_args,
        ),
        CUDAExtension('nms_cpu', [
            'src/nms_cpu.cpp',
        ]),
    ],
    cmdclass={'build_ext': BuildExtension})

By rebuilding this particular nms on the TITAN X machine and reinstalling mmdet, it now works on GTX 1080 Ti too. Nonetheless, the real cause is likely something behind. Since the setup.py is per CUDA module basis, it would be tedious to change all. Any better suggestion and clarifications?

Update: it should be Titan X, not V.

hellock commented 5 years ago

The failure case is quite weird. It seems to be neither a forward nor backward compatibility issue for different architectures. (Isn't TITAN V the Volta arch?) Manually specifying the target arch can be a walkaround, and maybe we can wait for someone who figures this out.

farleylai commented 5 years ago

I now tried to compare the output of cuobjdump -ptx of shared libraries nms_cuda*.so from different machines and found that only the one produced on Titan X is different from those on RTX Titan and 1080 Ti.

The objdumped results are attached FYI:

Though the arch and target are the same sm_30 by default, the PTX code version produced on Titan X is higher 6.4 with very different align offsets.

PS: the nvcc installation through conda is the same across different machines over NFS:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Any insights?

Update: it should be Titan X, not V.

Simon4Yan commented 5 years ago

I also meet this issue, and I need some help.

inds = nms_cuda.nms(dets_th, iou_thr) RuntimeError: CUDA error: a PTX JIT compilation failed (launch_kernel at /opt/conda/conda-bld/pytorch_1565272269120/work/aten/src/ATen/native/cuda/Loops.cuh:102) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f889750ae37 in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libc10.so) frame #1: void at::native::gpu_index_kernel<nv_dl_wrapper_t<nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef)), 1u>> >(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef, nv_dl_wrapper_t<nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef, c10::ArrayRef)), 1u>> const&) + 0x79f (0x7f889e583b0f in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #2: + 0x54ce442 (0x7f889e57d442 in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #3: + 0x54ce7a8 (0x7f889e57d7a8 in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #4: + 0x16293bb (0x7f889a6d83bb in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #5: at::native::index(at::Tensor const&, c10::ArrayRef) + 0x417 (0x7f889a6d5887 in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #6: at::TypeDefault::index(at::Tensor const&, c10::ArrayRef) + 0x74 (0x7f889abc3bc4 in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #7: torch::autograd::VariableType::index(at::Tensor const&, c10::ArrayRef) + 0x836 (0x7f889c4341d6 in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #8: at::Tensor::index(c10::ArrayRef) const + 0x70 (0x7f889c260840 in /home/dengweijian/.conda/envs/mmlab/lib/python3.5/site-packages/torch/lib/libtorch.so) frame #9: nms_cuda(at::Tensor, float) + 0x79a (0x7f887a306904 in /home/dengweijian/Documents/mmdet/mmdet/ops/nms/nms_cuda.cpython-35m-x86_64-linux-gnu.so) frame #10: nms(at::Tensor const&, float) + 0x189 (0x7f887a2f6689 in /home/dengweijian/Documents/mmdet/mmdet/ops/nms/nms_cuda.cpython-35m-x86_64-linux-gnu.so) frame #11: + 0x27766 (0x7f887a304766 in /home/dengweijian/Documents/mmdet/mmdet/ops/nms/nms_cuda.cpython-35m-x86_64-linux-gnu.so) frame #12: + 0x241c0 (0x7f887a3011c0 in /home/dengweijian/Documents/mmdet/mmdet/ops/nms/nms_cuda.cpython-35m-x86_64-linux-gnu.so)

frame #24: __libc_start_main + 0xe7 (0x7f88d63d8b97 in /lib/x86_64-linux-gnu/libc.so.6)

MikhailMashukov commented 4 years ago

It seems like for some reason, the default nvcc compilation does not generate code support for GTX 1080 Ti on a Titan X machine. In that case, one may explicitly add the architecture and gencode options in the setup.py. Here is the example of mmdet/ops/nms/setup.py:
nvcc_ARCH  = ['-arch=sm_52']
nvcc_ARCH += ["-gencode=arch=compute_75,code=\"compute_75\""]
nvcc_ARCH += ["-gencode=arch=compute_75,code=\"sm_75\""]
...
I am also struggling with such problem. Could you please clarify, how to apply this? To add into a fresh mmdet/ops/nms/setup.py? Fresh file requires imports and maybe something else. To run it separately then from mmdetection/mmdet/ops/nms/ subfolder? 

I was able to run the code by copying mmdetection/setup.py into mmdetection/mmdet/ops/nms/ and replacing the code after if __name__ == '__main__': but the error still remains, don't know how to check whether something advanced to its fixing.

P.S. Suddenly setting a new equal conda environment with better nvcc version used fixed the problem (9.1 is installed in /usr/bin and 10.1 in /usr/local/cuda-10.1, I switched to the latter by adding to PATH). So looks like a build/rebuild problem since everything else is mainly equal.

juanluisrosaramos commented 4 years ago

I had this same problem with a GCloud Tesla P100-PCIE-16GB Pytorch 1.3.1 / torchvision 0.4.2/ cudatoolkit 10.1.243 / CUDA driver version 10020

Can't solve it so I move to Docker and it works. ARG PYTORCH="1.1.0" ARG CUDA="10.0" ARG CUDNN="7.5" So, it is not a GPU problem but smtg to do with software versions. Thanks.

open-mmlab / mmdetection

RuntimeError: CUDA error: a PTX JIT compilation failed #766