Open stephanecharette opened 12 months ago
This is a continuation of https://github.com/AlexeyAB/darknet/issues/8669
Using:
libcudnn8=8.5.0.96-1+cuda11.7 libcudnn8-dev=8.5.0.96-1+cuda11.7
But also recreated using 8.9.3.28-1+cuda11.8.
Me too:
Ubuntu 22.04.3
libcudnn8=8.9.4.25-1+cuda12.2 libcudnn8-dev=8.9.4.25-1+cuda12.2
`... -> next mAP calculation will be at iteration #1000 Tensor Cores are disabled until iteration #3000. 1000: loss=4.558, avg loss=4.317, rate=0.001000, 103.801 milliseconds, 32000 images, time remaining=30 hours
calculating mAP (mean average precision)... Detection layer #30 is type 28 (yolo) Detection layer #37 is type 28 (yolo) using 4 threads to load 420 validation images for mAP% calculations processing #0 (0%) cuDNN status error in /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554
backtrace (13 entries): 1/13: darknet(_Z13log_backtracev+0x38) [0x55b121550ce8] 2/13: darknet(darknet_fatal_error+0x1bd) [0x55b121550f4d] 3/13: darknet(cudnn_check_error_extended+0x83) [0x55b1214982b3] 4/13: darknet(forward_convolutional_layer_gpu+0x2d5) [0x55b12148bce5] 5/13: darknet(forward_network_gpu+0xe1) [0x55b12152b9d1] 6/13: darknet(network_predict_gpu+0x140) [0x55b12152e660] 7/13: darknet(validate_detector_map+0xa06) [0x55b1214afa56] 8/13: darknet(train_detector+0x1475) [0x55b1214b2185] 9/13: darknet(_Z12run_detectoriPPc+0xa85) [0x55b1214b60f5] 10/13: darknet(main+0x4a1) [0x55b1214454e1] 11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6dd2e29d90] 12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6dd2e29e40] 13/13: darknet(_start+0x25) [0x55b121447ef5] Command exited with non-zero status 1`
You probably know this but it usually works if you set subdivisions to 64. Just leaves a lot of wasted memory on the card and quadruples training time. Thanks for trying on this, this is probably the biggest pain in the ass for the last two years with darknet. I gave up and wrote bash scripts to stop, run map, post it online and start training again. Would be nice to get map in training working well.
Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.
Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.4.1.50-1+cuda11.6_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8-dev_8.4.1.50-1+cuda11.6_amd64.deb
This should do it.
Ill give it a try thank you
If this error occurs, the config [net] burn_in=1000
If you set this value to 800, a similar error will occur on the 800th try. Also, if you set this value to 100, result is same.
but if you set subdevision to non x2 scales such like 6, 10 , this error is not occur. I think it's a problem with the results of internal multiplication or division. The burn-in result may be an error due to the number of training files or other factors.
User "cmorzy" reported today that they're still seeing the error/crash when Darknet reaches iteration #1000. A copy of the dataset, .names, and .cfg is available.
The exact message they're seeing is: