lfz / DSB2017

The solution of team 'grt123' in DSB2017
MIT License
1.23k stars 420 forks source link

RuntimeError: CUDNN_STATUS_INTERNAL_ERROR #39

Closed MoonBunnyZZZ closed 7 years ago

MoonBunnyZZZ commented 7 years ago

I run the program as testing phase instruction said.'RuntimeError: CUDNN_STATUS_INTERNAL_ERROR' occured.What make this happen??

lfz commented 7 years ago

I can't tell you where the problem is with only an error report

I believe that this is a system problem, you should try to run some basic pytorch program and check that whether this error happen again

2017-08-21 17:08 GMT+08:00 MoonBunnyZZZ notifications@github.com:

I run the program as testing phase instruction said.'RuntimeError: CUDNN_STATUS_INTERNAL_ERROR' occured.What make this happen??

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lfz/DSB2017/issues/39, or mute the thread https://github.com/notifications/unsubscribe-auth/AIigQ4rYgw78-Yqf9vB6ITmB3LWyFypcks5saUkAgaJpZM4O9Eul .

-- 廖方舟 清华大学医学院 Liao Fangzhou School of Medicine Tsinghua University Beijing 100084 China

MoonBunnyZZZ commented 7 years ago

Thank you first.

I try some samples in https://github.com/pytorch/examples. They are 'mnist','reinforcement_learning' and 'word_language_model'(this one use cuda).they work well,no crash.I want to run the 'imagenet' because I think it use cudnn in the code 'import torch.backends.cudnn as cudnn',but a large dataset should be downloaded.My Internet speed is so bad that I give up this sample finally.

If you could give some good sample to test my system?I appreciate you help very much and really hope to solve this problem as soon as possible.I am not far away from Tsinghua University.No offence, may I consult you offline?I konw you are busy and I can't ask any contact information here because your privacy.So here is my phone number 15910965967, the wechat is same.

Waiting for your replay!Thank you again!

MoonBunnyZZZ commented 7 years ago

PS:Besides,I alos run the cudnn-sample-v5.It worked correctly.So I am confused more rightnow about what exactly make that RuntimeError happen.

lfz commented 7 years ago

你好歹截个图,告诉我在哪一行出错了。。。。

还有完整的错误报告

2017-08-23 16:57 GMT+08:00 MoonBunnyZZZ notifications@github.com:

PS:Besides,I alos run the cudnn-sample-v5.It worked correctly.So I am confused more rightnow about what exactly make that RuntimeError happen.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lfz/DSB2017/issues/39#issuecomment-324266555, or mute the thread https://github.com/notifications/unsubscribe-auth/AIigQ5QlGbbn_a6Q343VhW7gFTByiXwIks5sa-l7gaJpZM4O9Eul .

-- 廖方舟 清华大学医学院 Liao Fangzhou School of Medicine Tsinghua University Beijing 100084 China

MoonBunnyZZZ commented 7 years ago

screenshot from 2017-08-24 08 59 24

MoonBunnyZZZ commented 7 years ago

弱弱问一下,能麻烦你加我微信一下么?15910965967 谢谢

zhaifly commented 7 years ago

batch size maybe too large with 8

BR. -zhaifly

在 2017年8月24日,09:22,MoonBunnyZZZ notifications@github.com 写道:

我在终端运行python main.py 结果如下: starting preprocessing 0b8afe447b5f1a2c405f41cf2fb1198e done end preprocessing (8L, 1L, 208L, 208L, 208L) Traceback (most recent call last): File "main.py", line 58, in test_detect(test_loader, nod_net, get_pbb, bbox_result_path,config1,n_gpu=config_submit['n_gpu']) File "/home/ubuntu/nndl/DSB2017/test_detect.py", line 52, in test_detect output = net(input,inputcoord) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in call result = self.forward(*input, kwargs) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in call result = self.forward(input, kwargs) File "/home/ubuntu/nndl/DSB2017/net_detector.py", line 102, in forward out = self.preBlock(x)#16 File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in call result = self.forward(*input, *kwargs) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 252, in call result = self.forward(input, **kwargs) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 351, in forward self.padding, self.dilation, self.groups) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py", line 119, in conv3d return f(input, weight, bias) RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

弱弱问下,能麻烦你加我微信一下么? 15910965967

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

MoonBunnyZZZ commented 7 years ago

screenshot from 2017-08-24 13 22 54

Should batch size be set here or not? The default value is '1 .

MoonBunnyZZZ commented 7 years ago

A idea suddenly struck me that the batch size is a fixed value because the test phage use the 'detector.ckpt' which is the result of train phase.

So if I want give a new value to batch size,I should trian a new model. Is it right?

glhfgg1024 commented 7 years ago

try num_workers_=1

blakeliu commented 6 years ago

Watch Out: '--n_test', default=8