Open wtliao opened 6 years ago
eventhrough I try to use single GPU, I have the same issue
I'm not entirely sure what's going on here, but it seems like you're using Python 2, which I don't support with this repo. have you tried using Python 3?
@rowanz when use py3.6., the error is:
RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCTensorMathPairwise.cu:21
I solved it in a strang way:
try:
keep.append(keep_im + s)
except BaseException:
keep.append(keep_im + s)
which means to operate it twice and it works..... I has no idea why.
Now I solved this problem by recompile the nms files using
#!/usr/bin/env bash
# CUDA_PATH=/usr/local/cuda/
cd src/cuda
echo "Compiling stnn kernels by nvcc..."
nvcc -c -o nms.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52
cd ../../
python build.py
But a new problem arises in
feature_pool = RoIAlignFunction(self.pooling_size, self.pooling_size, spatial_scale=1 / 16)(
self.compress(features) if self.use_resnet else features, rois)
with error information
cudaCheckError() failed : no kernel image is available for execution on the device
Process finished with exit code 255
I can't figure out where the problem is. Do you have any idea about it? I fix it by replacing the roi_align file with this roi_align. Now the code can run through. If you can fix the original roi_align, it will be much better. Thanks for sharing your impressive work again
my guess is that you have a newer version of cuda than I did last year. Possibly you’d need to compile it with -arch=sm_61 ? Sorry for the difficulty anyways, I really wish pytorch had native roipooling (which they’re working on for v1)
On Mon, Oct 22, 2018 at 8:18 AM wtliao notifications@github.com wrote:
Now I solved this problem by recompile the nms files using
!/usr/bin/env bash
CUDA_PATH=/usr/local/cuda/
cd src/cuda echo "Compiling stnn kernels by nvcc..." nvcc -c -o nms.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52 cd ../../ python build.py
But a new problem arises in
feature_pool = RoIAlignFunction(self.pooling_size, self.pooling_size, spatial_scale=1 / 16)( self.compress(features) if self.use_resnet else features, rois)
with error information
cudaCheckError() failed : no kernel image is available for execution on the device
Process finished with exit code 255
I can't figure out where the problem is. Do you have any idea about it? I fix it by replacing the roi_align file with this roi_align https://github.com/jwyang/faster-rcnn.pytorch/tree/master/lib/model/roi_align . Now the code can run through. If you can fix the original roi_align, it will be much better. Thanks for sharing your impressive work again
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/rowanz/neural-motifs/issues/33#issuecomment-431865986, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWJx2nPZNj5WbF95GdiIbc3mTDk4epXks5uneHbgaJpZM4XsrNn .
@rowanz I'm facing the exact same issue. But for me, doing the same steps as @wtliao suggested didn't get rid of the error. I'm using CUDA 8.0 and Tesla K80s. Therefore I even tried compiling with sm_37 in nms, roi_aling and highway_lstms. What would you advise me to do?
@wtliao could you find a fix for the error?
Hi, I solved the issues as I described above. I have tried the code on CUDA9.0+K40, CUDA9.0+P100, and CUDA8.0+TITAN XP, they all works now. So I guess, you can try to update to CUDA9.0. I can't fix the roi_aling issues in the author's code. I replace it with mine. BTW, I didn't compile the code using the Makefile provided by the autor. I compiled each part of the code one by one using my make.sh under the corresponding dir.
@rowanz I'm facing the exact same issue. But for me, doing the same steps as @wtliao suggested didn't get rid of the error. I'm using CUDA 8.0 and Tesla K80s. Therefore I even tried compiling with sm_37 in nms, roi_aling and highway_lstms. What would you advise me to do?
@wtliao could you find a fix for the error?
Thanks a lot @wtliao , I got it running now. And thanks a lot @rowanz for sharing your amazing work. One small doubt, can you please tell me the interpretation of pred_rel_inds
part of the output (I believe it's a [NUM_PRED_RELS, 51]
array with each pair, having scores for the 50 types of relationships. The 0th index is No relationship right? (Because the total number of relationships is 50)
@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!
@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!
you should run the command line "export PYTHONPATH=where/is/your/project/folder" before running.
Hello, thank you very much for your reply. I have already set this, but the same problem will still occur. Is it the problem that the. so file cannot be found? In addition, I always display: Runtime Error: cuda Runtime Error (2): Out of Memory. I have changed batch_size to 1, but the same problem still occurs. Do you have any good suggestions?
------------------ 原始邮件 ------------------ 发件人: "rowanz/neural-motifs" <notifications@github.com>; 发送时间: 2020年10月12日(星期一) 晚上8:16 收件人: "rowanz/neural-motifs"<neural-motifs@noreply.github.com>; 抄送: "李建红"<2421434674@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [rowanz/neural-motifs] RuntimeError: cuda runtime error (8) in nms (#33)
@wtliao , I always show that bbox_overlaps can't be found in the process of running, but I have generated the. so file. Do you have any suggestions to help me?Thank you!
you should run the command line "export PYTHONPATH=where/is/your/project/folder" before running.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi, I have encounted the following error when I run the code on two Titan XP:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518238409320/work/torch/lib/THC/generic/THCTensorMathPairwise.cu line=21 error=8 : invalid device function Traceback (most recent call last): File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1668, in
main()
File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1662, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/home/wtliao/.pycharm_helpers/pydev/pydevd.py", line 1072, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 214, in
rez = train_epoch(epoch)
File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 70, in train_epoch
tr.append(train_batch(batch))
File "/home/wtliao/work_space/neural-motifs-master/models/train_detector.py", line 103, in train_batch
result = detector[b]
File "/home/wtliao/work_space/neural-motifs-master/lib/object_detector.py", line 418, in getitem
outputs = nn.parallel.parallel_apply(replicas, [batch[i] for i in range(self.num_gpus)])
File "/home/wtliao/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
After debug line by line, I find that this error arises in the operation: keep.append(keep_im + s), line 24 in nms.py
Any idea to solve it? Thanks!