pjreddie / darknet

Convolutional Neural Networks
http://pjreddie.com/darknet/
Other
25.73k stars 21.33k forks source link

No detections with CUDNN=1 and tiny-yolo #405

Open jacobsuh opened 6 years ago

jacobsuh commented 6 years ago

When I had CUDNN=0 and GPU=1, both yolo.weights and tiny-yolo.weights worked fine, but when I recompiled with CUDNN=1 and GPU=1, tiny-yolo.weights no longer has detections (even with a very low threshold). Strangely enough, the normal yolo.weights still works. Any help on why this could be?

Also, is the only difference between the two setups is that in the latter, it's now using the cuDNN library, as well as CUDA already being used? How much of a performance improvement would there realistically be between the two?

ivpusic commented 6 years ago

I have the same issue

ahsan856jalal commented 6 years ago

You can try one thing: with CUDNN=1, GPU=1, Opencv=1 train one small model with one class having ~1000 items for around 2k iterations and then with the same configuration, test the detection.

ahsan856jalal commented 6 years ago

I believe you are using this sort of command in the first place to test the image. ./darknet detector test data/coco.data cfg/yolo.cfg yolo.weights data/dog.jpg -i 0 -thresh 0.2

kausb commented 6 years ago

Hi Jacob/IvPusic, I am facing a similar issue with custom trained weights and tiny yolo (CUDNN=1, GPU=1, OpenCV=0). With CUDNN =0,GPU=1, inferencing works fine. Have you been able to debug/resolve this issue?

barkermlpg commented 6 years ago

Very interesting post and I would also like to know how to resolve this issue and help if possible. I'm using Yolo_v2 and ran into the same issue on one of two systems:

What other system information would be useful to debug the issue? Please share, thanks.

chiefkarlin commented 6 years ago

Also running into this issue on a machine with a Tesla K80, Ubuntu 16.04, cuda 9.1, cudnn 7.12 using a custom trained model.

Compiling with CUDNN=1 detections work fine.

cometyang commented 6 years ago

I had the similar issue on P100, ubuntu 16.04, cuda 8.0, cudnn 7.05 when I run example script ./darknet classifier predict cfg/imagenet1k.data cfg/extraction.cfg extraction.weights data/dog.jpg GPU=0, CUDNN=0 GPU=1, CUDNN=0 both works, but GPU=1, CUDNN=1 fail. output


Loading weights from extraction.weights...Done! data/dog.jpg: Predicted in 0.003717 seconds. 0.41%: bucket 0.39%: hook 0.39%: tennis ball 0.35%: paper towel 0.35%: water bottle

TetsuakiBaba commented 6 years ago

Hi, I got same issue on classifier and detector option. As a result, I resolved it by editing cfg file.

I got a same issue by typing a below command, ./darknet classifier predict cfg/imagenet1k.data cfg/extraction.cfg extraction.weights data/dog.jpg

but, after changing batch and subdivisions parameter on extraction.cfg, I got a correct recognition result.

I think, whenever we predict or test or demo on darknet with GPU, we have to be sure cfg file is test mode( i.e. batch=1, subdivisions=1). Default setting of extraction.cfg is batch=128, subdivision=8, which is train mode settings.

Anyway, it runs correctly on CPU mode, but on GPU, we have to change batch and subdivisions. I hope this will help you.

barkermlpg commented 6 years ago

That is it! I just confirmed on my system. This solves it, thank you!

fspeed commented 6 years ago

I had the similar issue on Titan X, ubuntu 18.04, cuda 9.0, cudnn 7.1 when I run example script ./darknet classifier predict cfg/imagenet1k.data cfg/extraction.cfg extraction.weights data/eagle.jpg GPU=0, CUDNN=0 GPU=1, CUDNN=0 both works, but GPU=0, CUDNN=1 fail. GPU=1, CUDNN=1 fail. output

Loading weights from extraction.weights...Done! data/dog.jpg: Predicted in 0.004810 seconds. 0.41%: bucket 0.39%: hook 0.39%: tennis ball 0.35%: paper towel 0.35%: water bottle

jerinka commented 6 years ago

I tried tiny darknet for classification and got error when i put cudnn=1, works fine when cudnn=0. I am using Cuda 8 and cudnn 6, i tried making batchsize=1 and subdivision=1, still error in result(always showing same values).

sharowyeh commented 6 years ago

darknet will pre-allocate GPU virtual memory for each layers if GPU=1 or CUDNN=1, depends on batch size and sub division settings in cfg file. For training, batch size indicated how many pictures will be performed to GPU in iteration, greater value can reduce the training time, and sub division can slices them to groups prevents memory size limitation issue if GPU does not have enough memory in iteration. Keeps 1 in detection because these settings are mainly for training network.

lintangsutawika commented 6 years ago

I had this same issue. A quick fix would be to copy your current cfg file and comment the training batch and subdivision while uncommenting the testing batch and subdivision. That way, you have 2 cfg file that differ there.

# Testing
batch=1
subdivisions=1
# Training
#batch=256
#subdivisions=64
braddockcg commented 5 years ago

Same issue here. Setting batch=1 sudivisions=1 worked for doing detections. thanks! Does this mean I can't train with CuDNN?

kuriel07 commented 5 years ago

i had the same issue, changing the cuDNN to older version and rebuild the project solved it

zkailinzhang commented 5 years ago

opencv=1 cuda=1 cudnn=1

same error, please modify cfg/yolov3.cfg,

# Testing
batch=1
subdivisions=1
# Training
#batch=256
#subdivisions=64

then, detector result display ok.