RuntimeError: cuda runtime error (8) in nms

wtliao commented 6 years ago

Hi, I have encounted the following error when I run the code on two Titan XP:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518238409320/work/torch/lib/THC/generic/THCTensorMathPairwise.cu line=21 error=8 : invalid device function Traceback (most recent call last): File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1668, in main() File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1662, in main globals = debugger.run(setup['file'], None, None, is_module) File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1072, in run pydev_imports.execfile(file, globals, locals) # execute the script File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 214, in rez = train_epoch(epoch) File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 70, in train_epoch tr.append(train_batch(batch)) File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 103, in train_batch result = detector[b] File "/home/wtliao/work_space/neural-motifs-master/lib/object_detector.py", line 418, in getitem outputs = nn.parallel.parallel_apply(replicas, [batch[i] for i in range(self.num_gpus)]) File "/home/wtliao/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply raise output

After debug line by line, I find that this error arises in the operation: keep.append(keep_im + s), line 24 in nms.py

Any idea to solve it? Thanks!

wtliao commented 6 years ago

eventhrough I try to use single GPU, I have the same issue

rowanz commented 6 years ago

I'm not entirely sure what's going on here, but it seems like you're using Python 2, which I don't support with this repo. have you tried using Python 3?

wtliao commented 6 years ago

@rowanz when use py3.6., the error is: RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCTensorMathPairwise.cu:21

I solved it in a strang way:

   try:
      keep.append(keep_im + s)
  except BaseException:
       keep.append(keep_im + s)

which means to operate it twice and it works..... I has no idea why.

wtliao commented 6 years ago

Now I solved this problem by recompile the nms files using

#!/usr/bin/env bash
# CUDA_PATH=/usr/local/cuda/
cd src/cuda
echo "Compiling stnn kernels by nvcc..."
nvcc -c -o nms.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52
cd ../../
python build.py

But a new problem arises in

feature_pool = RoIAlignFunction(self.pooling_size, self.pooling_size, spatial_scale=1 / 16)(
            self.compress(features) if self.use_resnet else features, rois)

with error information

cudaCheckError() failed : no kernel image is available for execution on the device

Process finished with exit code 255

I can't figure out where the problem is. Do you have any idea about it? I fix it by replacing the roi_align file with this roi_align. Now the code can run through. If you can fix the original roi_align, it will be much better. Thanks for sharing your impressive work again

rowanz commented 6 years ago

my guess is that you have a newer version of cuda than I did last year. Possibly you’d need to compile it with -arch=sm_61 ? Sorry for the difficulty anyways, I really wish pytorch had native roipooling (which they’re working on for v1)

On Mon, Oct 22, 2018 at 8:18 AM wtliao notifications@github.com wrote:

Now I solved this problem by recompile the nms files using

!/usr/bin/env bash

CUDA_PATH=/usr/local/cuda/

cd src/cuda echo "Compiling stnn kernels by nvcc..." nvcc -c -o nms.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52 cd ../../ python build.py

But a new problem arises in

feature_pool = RoIAlignFunction(self.pooling_size, self.pooling_size, spatial_scale=1 / 16)( self.compress(features) if self.use_resnet else features, rois)

with error information

cudaCheckError() failed : no kernel image is available for execution on the device

Process finished with exit code 255

I can't figure out where the problem is. Do you have any idea about it? I fix it by replacing the roi_align file with this roi_align https://github.com/jwyang/faster-rcnn.pytorch/tree/master/lib/model/roi_align . Now the code can run through. If you can fix the original roi_align, it will be much better. Thanks for sharing your impressive work again

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/rowanz/neural-motifs/issues/33#issuecomment-431865986, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWJx2nPZNj5WbF95GdiIbc3mTDk4epXks5uneHbgaJpZM4XsrNn .

ritwickchaudhry commented 5 years ago

@rowanz I'm facing the exact same issue. But for me, doing the same steps as @wtliao suggested didn't get rid of the error. I'm using CUDA 8.0 and Tesla K80s. Therefore I even tried compiling with sm_37 in nms, roi_aling and highway_lstms. What would you advise me to do?

@wtliao could you find a fix for the error?

wtliao commented 5 years ago

Hi, I solved the issues as I described above. I have tried the code on CUDA9.0+K40, CUDA9.0+P100, and CUDA8.0+TITAN XP, they all works now. So I guess, you can try to update to CUDA9.0. I can't fix the roi_aling issues in the author's code. I replace it with mine. BTW, I didn't compile the code using the Makefile provided by the autor. I compiled each part of the code one by one using my make.sh under the corresponding dir.

@rowanz I'm facing the exact same issue. But for me, doing the same steps as @wtliao suggested didn't get rid of the error. I'm using CUDA 8.0 and Tesla K80s. Therefore I even tried compiling with sm_37 in nms, roi_aling and highway_lstms. What would you advise me to do?

@wtliao could you find a fix for the error?

ritwickchaudhry commented 5 years ago

Thanks a lot @wtliao , I got it running now. And thanks a lot @rowanz for sharing your amazing work. One small doubt, can you please tell me the interpretation of pred_rel_inds part of the output (I believe it's a [NUM_PRED_RELS, 51] array with each pair, having scores for the 50 types of relationships. The 0th index is No relationship right? (Because the total number of relationships is 50)

L6-hong commented 4 years ago

@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!

wtliao commented 4 years ago

@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!

you should run the command line "export PYTHONPATH=where/is/your/project/folder" before running.

L6-hong commented 4 years ago

Hello, thank you very much for your reply. I have already set this, but the same problem will still occur. Is it the problem that the. so file cannot be found? In addition, I always display: Runtime Error: cuda Runtime Error (2): Out of Memory. I have changed batch_size to 1, but the same problem still occurs. Do you have any good suggestions?

------------------ 原始邮件 ------------------ 发件人: "rowanz/neural-motifs" <notifications@github.com>; 发送时间: 2020年10月12日(星期一) 晚上8:16 收件人: "rowanz/neural-motifs"<neural-motifs@noreply.github.com>; 抄送: "李建红"<2421434674@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [rowanz/neural-motifs] RuntimeError: cuda runtime error (8) in nms (#33)

@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!

you should run the command line "export PYTHONPATH=where/is/your/project/folder" before running.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

rowanz / neural-motifs

RuntimeError: cuda runtime error (8) in nms #33

!/usr/bin/env bash

CUDA_PATH=/usr/local/cuda/