longcw / yolo2-pytorch

YOLOv2 in PyTorch
1.54k stars 421 forks source link

cuda runtime error : invalid device function #12

Closed rdfong closed 7 years ago

rdfong commented 7 years ago

Hello, I'm currently working on project using YOLO v2 as a base and am very interested in using your pytorch implementation as a starting point. I've run into a strange issue right from the start however when running the test and demo scripts.

The error is: RuntimeError: cuda runtime error (8) : invalid device function at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/THCTensorCopy.cu:204

Oddly, when running the demo script, the first image with the computer appears with detection boxes and the error only happens after hitting the key and trying to move to the next image.

I've already changed the architecture in make.sh to sm_30 which is what my video card is compatible with. Have you run into this kind of issue before? Perhaps there is another architecture setting I'm missing somewhere or maybe it has to do with my install of pytorch itself...

Let me know if you have any ideas. Once I get this running I hope to port over your mAP scoring code to pjreddie's implementation and compare scores.

rdfong commented 7 years ago

Any thoughts here? Any help is much appreciated!

longcw commented 7 years ago

Sorry for my late reply. I have no idea about your problem. But maybe you can provide me some debug information, just like the line on which the error happened.

rdfong commented 7 years ago

Hello @longcw !

My current setup is cuda 8.0 with cudnn 5.1 and nvidia drivers 375.39 on a GTX870M.

The error occurs on line 49 of demo.py frame: 0, (detection: 2.0 Hz, 498.6 ms) (total: 1.9 Hz, 532.8 ms) THCudaCheck FAIL file=/data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/THCTensorCopy.cu line=204 error=8 : invalid device function Traceback (most recent call last): File "./demo.py", line 49, in bbox_pred, iou_pred, prob_pred = net(im_data)

and again on line 55 when running test.py File "./test.py", line 118, in test_net(net, imdb, max_per_image, thresh, vis) File "./test.py", line 55, in test_net bbox_pred, iou_pred, prob_pred = net(im_data)

Admittedly this if my first time using any sort of machine learning library whatsoever so it may be something as simple as me installing something incorrectly. Thanks a bunch for your help. In the meantime I'll keep scouring the web for hints as to what the issue might be.


The remainder of the stack trace is common to both failures and goes into the torch libraries through:

File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in call result = self.forward(*input, kwargs) File "/home/roger/HydroContestDetection/ML/yolo2-pytorch/darknet.py", line 171, in forward conv1s = self.conv1s(im_data) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 64, in forward input = module(input) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in call result = self.forward(input, kwargs) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 64, in forward input = module(input) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in call result = self.forward(*input, *kwargs) File "/home/roger/HydroContestDetection/ML/yolo2-pytorch/utils/network.py", line 31, in forward x = self.conv(x) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in call result = self.forward(input, **kwargs) File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 237, in forward self.padding, self.dilation, self.groups) File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 38, in conv2d return f(input, weight, bias) if bias is not None else f(input, weight) File "/usr/local/lib/python2.7/dist-packages/torch/nn/_functions/conv.py", line 32, in forward input = input.contiguous() RuntimeError: cuda runtime error (8) : invalid device function at /data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/THCTensorCopy.cu:204

rdfong commented 7 years ago

Perhaps the issue is that I'm using sm_30 and not sm_35, which is where "Dynamic parallelism support" is added according to the CUDA docs. Simply basing this on a hunch, but perhaps THCTensorCopy.cu requires sm_35 or greater though I would have expected the fact that I use sm_30 to be detected by the cuda framework somewhere and adjusted for.

rdfong commented 7 years ago

@longcw

Could you tell me which version of CUDA you are using?

Thanks, Roger

rdfong commented 7 years ago

Ended up switching to a different machine, where things seem to work better.