pjreddie / darknet

Convolutional Neural Networks
http://pjreddie.com/darknet/
Other
25.84k stars 21.33k forks source link

Multiple NaNs, Improving training speed and other Issues. #1176

Open tahakhursheed opened 6 years ago

tahakhursheed commented 6 years ago

I am trying to train YOLO V3 on a custom dataset to detect a single object.

After 7 hours of training on a NVIDIA 1080 GTX (8GB RAM), this is what I get

Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000012, .5R: -nan, .75R: -nan,  count: 0
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000002, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000007, .5R: -nan, .75R: -nan,  count: 0
Region 106 Avg IOU: 0.705428, Class: 0.998099, Obj: 0.387307, No Obj: 0.000543, .5R: 1.000000, .75R: 0.250000,  count: 12
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000004, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000008, .5R: -nan, .75R: -nan,  count: 0
Region 106 Avg IOU: 0.646503, Class: 0.998807, Obj: 0.719028, No Obj: 0.002933, .5R: 0.812500, .75R: 0.203125,  count: 64
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000013, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.667474, Class: 0.999867, Obj: 0.596711, No Obj: 0.000445, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: 0.795498, Class: 0.997384, Obj: 0.507768, No Obj: 0.000523, .5R: 1.000000, .75R: 0.833333,  count: 6
5118: 3.544026, 3.641259 avg, 0.001000 rate, 6.250511 seconds, 163776 images

The output looks something like the above,

This is the output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   89C    P2    83W /  N/A |   2381MiB /  8111MiB |     80%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1108      G   /usr/lib/xorg/Xorg                           183MiB |
|    0      1718      G   compiz                                       158MiB |
|    0      8157      C   ./darknet                                   2035MiB |
+-----------------------------------------------------------------------------+

My current

 batch=32
 subdivisions=16

I selected this after multiple trial and error and getting the CUDA out of error message.

  1. Why do I get all the NANs even after 7hours or more of training?
  2. Is it okay to run the training for a single object on YOLOv3 or should Tiny Yolo be preferred?
  3. Why is the GPU memory usage of training limited to around 2GB only? I tried Faster RCNN training on Tensorflow and GPU Memory usage was around 7.5 GB.
  4. How can I improve the speed of training?

PS: Pardon my ignorance, I am a complete beginner.

Bakuriu commented 6 years ago

I have (had?) the same problem. The issue seem to be the batch normalization steps which does not work with too small mini-batches (see: https://github.com/pjreddie/darknet/issues/715). In my case in the beginning I was using batch_size=8 and subdivisions=2, which, if I'm not mistaken means a mini-batch size of 4 (8/2 == 4).

Increasing batch_size to 16 while keeping subdivisions at 2 helped, but I'm still getting quite a few NaNs. The author has stated in an other issue that a couple NaNs are fine, but I feel like they are too many. Right now I'm training and I'll see this night/in the next days whether the model is learning anything useful at all.

The memory usage is also tied to these parameters: the bigger the mini-batches the more GPU ram it will use. With my former settings it used just above 1G of ram, while with my current settings the ram used is around 2G-3G with spikes up to 3.5G (in my case I wish to keep memory usage strictly below 3G because I will have to run this on a GTX 780 so I'm trying to find a set of parameters that avoid memory errors but are still able to learn something. My former parameters did not learn anything at all).

fengxiuyaun commented 6 years ago

you can increase batch_size or decrease subdivisions. then memory increase and Nan also decrease