hank-ai / darknet

Darknet/YOLO object detection framework
https://darknetcv.ai/
Apache License 2.0
221 stars 33 forks source link

darknet crashes when calculating mAP% at iteration #1000 #2

Open stephanecharette opened 12 months ago

stephanecharette commented 12 months ago

User "cmorzy" reported today that they're still seeing the error/crash when Darknet reaches iteration #1000. A copy of the dataset, .names, and .cfg is available.

The exact message they're seeing is:

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #546
* Error message:  cuDNN current error: status=3, CUDNN_STATUS_BAD_PARAM
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (13 entries):
1/13: ./darknet(log_backtrace+0x38) [0x560b3fb79128]
2/13: ./darknet(darknet_fatal_error+0x19d) [0x560b3fb7936d]
3/13: ./darknet(cudnn_check_error_extended+0x83) [0x560b3fb7bf83]
4/13: ./darknet(forward_convolutional_layer_gpu+0x2c5) [0x560b3fc56985]
5/13: ./darknet(forward_network_gpu+0xe1) [0x560b3fc6af81]
6/13: ./darknet(network_predict_gpu+0x140) [0x560b3fc6d800]
7/13: ./darknet(validate_detector_map+0xa49) [0x560b3fc02f29]
8/13: ./darknet(train_detector+0x1ce0) [0x560b3fc05f70]
9/13: ./darknet(run_detector+0x9f6) [0x560b3fc09996]
10/13: ./darknet(main+0x4b3) [0x560b3fb308b3]
11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6ed5bd7d90]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6ed5bd7e40]
13/13: ./darknet(_start+0x25) [0x560b3fb32b25]
Segmentation fault (core dumped)
stephanecharette commented 12 months ago

This is a continuation of https://github.com/AlexeyAB/darknet/issues/8669

chrislytras commented 12 months ago

Using:

libcudnn8=8.5.0.96-1+cuda11.7 libcudnn8-dev=8.5.0.96-1+cuda11.7

But also recreated using 8.9.3.28-1+cuda11.8.

sinyb commented 10 months ago

Me too:

Ubuntu 22.04.3

libcudnn8=8.9.4.25-1+cuda12.2 libcudnn8-dev=8.9.4.25-1+cuda12.2

`... -> next mAP calculation will be at iteration #1000 Tensor Cores are disabled until iteration #3000. 1000: loss=4.558, avg loss=4.317, rate=0.001000, 103.801 milliseconds, 32000 images, time remaining=30 hours

calculating mAP (mean average precision)... Detection layer #30 is type 28 (yolo) Detection layer #37 is type 28 (yolo) using 4 threads to load 420 validation images for mAP% calculations processing #0 (0%) cuDNN status error in /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554


kdill00 commented 10 months ago

You probably know this but it usually works if you set subdivisions to 64. Just leaves a lot of wasted memory on the card and quadruples training time. Thanks for trying on this, this is probably the biggest pain in the ass for the last two years with darknet. I gave up and wrote bash scripts to stop, run map, post it online and start training again. Would be nice to get map in training working well.

kdill00 commented 10 months ago

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

chrislytras commented 10 months ago

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.4.1.50-1+cuda11.6_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8-dev_8.4.1.50-1+cuda11.6_amd64.deb

This should do it.

kdill00 commented 9 months ago

Ill give it a try thank you

suminoshi commented 9 months ago

If this error occurs, the config [net] burn_in=1000

If you set this value to 800, a similar error will occur on the 800th try. Also, if you set this value to 100, result is same.

but if you set subdevision to non x2 scales such like 6, 10 , this error is not occur. I think it's a problem with the results of internal multiplication or division. The burn-in result may be an error due to the number of training files or other factors.