Training stuck at 8% - Githubissues

thtrieu / darkflow

Translate darknet to tensorflow. Load trained weights, retrain/fine-tune using tensorflow, export constant graph def to mobile devices

GNU General Public License v3.0

6.13k stars 2.08k forks source link

Training stuck at 8% #1159

Open Sequential-circuits opened 4 years ago

Sequential-circuits commented 4 years ago

We are training a model with 83 thousand pictures of (2592x1944 ) of 200 different supermarket products in 200 classes.

To simplify, we created a single class for all products and put that single class in all XML files

We are training it on a Tesla V100 with 32 Gb and the command flow --model /root/convert/meu.cfg --train --annotation /root/convert/train --dataset /root/retail/images --gpu 0.9 --batch 50 --lr 0.01 --trainer adam

The cfg file has width=832 height=832 filters=30

No matter if we change the learning rate or the number of pictures per batch, the training gets stuck at around 8% and will not go down then

Any suggestions on how can we go pass this snag? Thank you

mfaramarzi commented 4 years ago

@Sequential-circuits I tried to fix this problem by varying lr periodically (as explained in th link below), specially when it stays unchanged for a while. also how sometimes it needs to be patient and higher iterations solves it.

https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/

Sequential-circuits commented 4 years ago

Thanks, you are right and actually, that´s what we did: we just let it run for some days and now we are down to 1.45%. We froze the model and it seems to work, and we let it run for some more. My question now would be what would be the usual loss people consider the model converged?

mfaramarzi commented 4 years ago

@Sequential-circuits it is supposed to reach very close to zero (lower than 0.1). You can do validation after some iterations to check the the function of your model, even before your loss be close to zero. Sometimes it may result in good detection even with relatively higher loss values.

Sequential-circuits commented 4 years ago

Thank you: we let it run for some more and it would ping pong between 1 and 2, so since it seems to work well we consider it trained.