Open tahakhursheed opened 6 years ago
I have (had?) the same problem. The issue seem to be the batch normalization steps which does not work with too small mini-batches (see: https://github.com/pjreddie/darknet/issues/715). In my case in the beginning I was using batch_size=8
and subdivisions=2
, which, if I'm not mistaken means a mini-batch size of 4
(8/2 == 4
).
Increasing batch_size
to 16
while keeping subdivisions
at 2
helped, but I'm still getting quite a few NaNs. The author has stated in an other issue that a couple NaNs are fine, but I feel like they are too many. Right now I'm training and I'll see this night/in the next days whether the model is learning anything useful at all.
The memory usage is also tied to these parameters: the bigger the mini-batches the more GPU ram it will use. With my former settings it used just above 1G of ram, while with my current settings the ram used is around 2G-3G with spikes up to 3.5G (in my case I wish to keep memory usage strictly below 3G because I will have to run this on a GTX 780 so I'm trying to find a set of parameters that avoid memory errors but are still able to learn something. My former parameters did not learn anything at all).
you can increase batch_size or decrease subdivisions. then memory increase and Nan also decrease
I am trying to train YOLO V3 on a custom dataset to detect a single object.
After 7 hours of training on a NVIDIA 1080 GTX (8GB RAM), this is what I get
The output looks something like the above,
This is the output of
nvidia-smi
My current
I selected this after multiple trial and error and getting the CUDA out of error message.
PS: Pardon my ignorance, I am a complete beginner.