pjreddie / darknet

Convolutional Neural Networks
http://pjreddie.com/darknet/
Other
25.82k stars 21.33k forks source link

Avg loss suddenly starts increasing #489

Open bunnyUpRoar opened 6 years ago

bunnyUpRoar commented 6 years ago

Hello, I am training YOLO with 2 classes and about 900 images. On the guide below it mentions that the avg Loss function should decrease at around 0.060730% and then no longer decrease. However, I notice that my avg loss reaches a low point of 0.75% and then suddenly starts going up to 400% and then back down. The next time it goes back down it takes a lot longer to decrease and oscillates usually at around 1.5%.

This is the guide I am talking about https://github.com/unsky/yolo-for-windows-v2/blob/master/README.md

Here is the graph of my average loss per iteration , the Y is the average loss and the X is the number of iterations.

https://i.imgur.com/j7XURBV.png

My question is whether this is expected or means that there is something wrong with my dataset/configuration. I know that YOLO augments the images, so it is possible that when the loss reaches this low a new augmentation may bring it all the way up. However, this is just my speculation.

Note: I am using ubuntu 16.04, with 2 titan xp video cards.

Many thanks.

sivagnanamn commented 6 years ago

What's your learning rate, steps & scale setting? Increasing learning rate can cause such spike in loss. Please share your cfg file to understand the problem better.

bunnyUpRoar commented 6 years ago

Sure:

[net] batch=64 subdivisions=32 height=416 width=416 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1

learning_rate=0.0001 burn_in=1000 max_batches = 80200 policy=steps steps=9000 scales=.1,.1 . . . [region] anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071 bias_match=1 classes=2 coords=4 num=5 softmax=1 jitter=.3 rescore=1

object_scale=5 noobject_scale=1 class_scale=1 coord_scale=1

absolute=1 thresh = .6 random=1

I omitted the configuration for the layers since, it is the same for as the yolo.cfg.

Looking at what you said, the spike does occur when the learning rate gets scaled at the 9000 iteration. But the scale is 0.1, which should decrease it.

sivagnanamn commented 6 years ago
learning_rate=0.0001
burn_in=1000
max_batches = 80200
policy=steps
steps=9000
scales=.1,.1

This shows that at 9001st batch, you're changing your learning rate from 0.0001 to 0.0001x0.1=0.00001. You can check this with your training logs at 9001st batch. In the loss graph, it is evident that there's a spike in the loss at approx 9000 batches.

If you're using multi GPU for training, please do as below (suggested by @pjreddie ):

When training YOLOv2 with multigpu it's a good idea to train it first on 1 gpu for like 1000 iterations, then switch to multigpu, training is just more stable that way. and yeah, the flag is -gpus 0,1,2,... it doesn't work well if you use > 4 gpus, i usually use 4.

https://groups.google.com/forum/#!msg/darknet/NbJqonJBTSY/Te5PfIpuCAAJ

Train your model with 1 GPU for the first few thousand iterations & then use the latest weights to initialize the network again and train with multi GPU setup.

And it looks like you've modified the steps & scales parameters, I would recommend to try the default values in yolo.cfg from Darknet repo.

bunnyUpRoar commented 6 years ago

Okay, I thought that the reason to train with one GPU for about 1000 iteration was because it wouldn't backup the weights, due to some sync issue. Anyways, I'll do what you mention and also reset the steps and scales parameters, and see if I get better results.

Thanks.

Isha8 commented 6 years ago

@bunnyUpRoar Did you get any better results with the avg loss? If yes, what do you think was causing it? For me as well, the avg loss is oscillating around 0.14. after 1500 iterations on about 300 images with 2 classes. Thanks

Rahul-Venugopal commented 6 years ago

Hi @bunnyUpRoar @bunnyUpRoar ,

I am trying to change the number of iterations in x-axis of avg loss plot.(For me x-axis is 0, 50200 , 100400 ...) . Can anyone plz tell me where I can change it ?

Also it is written on avg plot window that if I press s it will save the graph as Chart.jpg . But unfortunately I am not able to save it also. Any suggestions ?

Thanks Rahul

pumdanny commented 5 years ago

Hi @bunnyUpRoar @bunnyUpRoar ,

I am trying to change the number of iterations in x-axis of avg loss plot.(For me x-axis is 0, 50200 , 100400 ...) . Can anyone plz tell me where I can change it ?

Also it is written on avg plot window that if I press s it will save the graph as Chart.jpg . But unfortunately I am not able to save it also. Any suggestions ? HI Rahul-Venugopal

in de file cfg/[your_file_name].cfg you have to change

Say Have to say max_batches = 500200 --> max_batches = [your max number of iterations] --> example --> max_batches = 10500

steps = 400000, 450,000 --> steps = [your value], [your value + 50000] --> example --> steps = 10000, 10500

Rahul-Venugopal commented 5 years ago

Hi @pumdanny @AlexeyAB ,

Thanks for your reply @pumdanny . Actually I do not want to change max number of iteration or steps. I am trying just to change the X-axis in avg-loss plot window. I would like to know whether I can change the X-axis interval without changing max number of iterations. Now I have the following setting

max_batches = 500200 steps = 400000, 450,000

and my x-axis in avg-loss graph is

50020 , 100040 , 150060 ......... (interval is 50020)

Is it possible to change this to

1000 , 2000 ,3000 ............ (interval of 1000)

without changing the maximum number of iterations = 500200

prateekgupta891 commented 5 years ago

Hi @AlexeyAB, While retraining "-map" is not generating the avg-loss plot window..do it get saved somewhere, from where i can open it and see? (i think i have issues with opencv)

7thstorm commented 4 years ago

I have a large dataset with 4 classes (around 25 M images). I am using 2080TI and after approx 3 weeks, I'm hovering between 0.1550 and 0.1754 average loss after 300K iterations. I've seen it go down to 0.14 at times, but this is where it seems stuck at. ( width and height to 416, max_batches to 26000000 and batch to 256). Yolov3 can I expect the avg loss to go down any further?