LOSS is zero when training on RTX4060

xxtkidxx commented 1 month ago

z5416596509303_577b87d0dcce6ab417cd3745525b1519 z5416596524828_6f57efe78adc5e11916616fb44ab3cfc

I change make file to SET (DARKNET_CUDA_ARCHITECTURES "75;80;86;89") Darknet can run on RTX4060, but after about 300 iterations, the Loss is always 0, the system running on RTX3060 is still fine. How to fix it? Thanks you. @stephanecharette

xxtkidxx commented 1 month ago

@stephanecharette Do you have any suggestions for me?

stephanecharette commented 1 month ago

Lots of questions:

What dataset?
What configuration?
What modifications have you made?
Why are you changing "DARKNET_CUDA_ARCHITECTURES"?
What command did you run?
Why have you not used DarkMark?
Why not use the discord server to get help instead of opening an issue?
What is the output of darknet --version?

xxtkidxx commented 1 month ago

What dataset?
- I use custom datasets, however I still train fine on RTX3000 or RTX2000 series
What configuration?
- I think the config file is not a problem, the problem only occurs on RTX4000 GPU
What modifications have you made? Why are you changing "DARKNET_CUDA_ARCHITECTURES"?
- I change make file to SET (DARKNET_CUDA_ARCHITECTURES "75;80;86;89") Because if I set SET DARKNET_CUDA_ARCHITECTURES "native"), there will be an error like the image below when training
What command did you run? Why have you not used DarkMark? Why not use the discord server to get help instead of opening an issue?
- I do not understand the question. I ran it fine on the RTX 3060 GPU, but when training on the RTX4060 GPU, I encountered the above problem. My config and dataset are very good
What is the output of darknet --version?
- Darknet v2.0-196-ga6c3224e-dirty
- Training on RTX2070Supper and RTX3060 is still fine. Detection is OK.

Training on RTX4060 after about 300 iterations, the Loss is always 0 on the same dataset and config file

@stephanecharette

Denizzje commented 1 month ago

I have a RTX4080 Super and I do not need to change anything in the "DARKNET_CUDA_ARCHITECTURES" to make it work. I think that is what is going wrong.

Is your CUDA/CUDNN up to date? I use CUDA 12.4 and CUDNN 8.9.7.

xxtkidxx commented 1 month ago

I have a RTX4080 Super and I do not need to change anything in the "DARKNET_CUDA_ARCHITECTURES" to make it work. I think that is what is going wrong.

Is your CUDA/CUDNN up to date? I use CUDA 12.4 and CUDNN 8.9.7.

I use CUDA 12.4 and CUDNN 9.0.0. I don't have RTX4080. Please send me your Darknet.exe files. I will be check in my side. @Denizzje

Denizzje commented 1 month ago

Sending my darknet files isn't useful, it needs to be built to your specification.

Ok I thought you maybe was using a too old CUDA for the RTX4000 series. Your CUDA should be good. To make extra sure, because it is the latest CUDA version, you also need to ensure you are on the latest nvidia-driver.

Run nvidia-smi in terminal.

If it does NOT show CUDA Version 12.4, you driver does not support CUDA 12.4 and you need to update you driver.

Regarding CUDNN, personally I use 8.9.7 because I cannot get 9.xx to work yet. I do not know for 100% sure, but maybe setting that DARKNET_CUDA_ARCHITECTURES forced through a build while there is something not properly installed. This in turn causes darknet unable to calculate properly, resulting in this 0 loss you are experencing. You could consider removing CUDNN 9, and build with CUDNN 8.9.7 which you can get from here: https://developer.nvidia.com/rdp/cudnn-archive .

This is how everything works for me without having to change anything when building following the cmake windows steps at https://github.com/hank-ai/darknet?tab=readme-ov-file#windows-cmake-method

hank-ai / darknet

LOSS is zero when training on RTX4060 #61