tensorflow / models

Models and examples built with TensorFlow
Other
76.95k stars 45.79k forks source link

Loss diverges to very high values in Object Detection API #4849

Closed Rayndell closed 4 years ago

Rayndell commented 6 years ago

System information

Describe the problem

While training a detection model with the object detection API (RFCN with Resnet v1 101 feature extractor) I encountered what looks like a bug to me (maybe an overflow). After learning for a little while without problems (with a custom dataset) the loss begins to display very large values that grow very quickly. At first I thought there was an overflow on the loss value but it seems more progressive: INFO:tensorflow:global step 69: loss = 0.5207 (0.422 sec/step) INFO:tensorflow:global step 70: loss = 0.2422 (0.406 sec/step) INFO:tensorflow:global step 71: loss = 0.5621 (0.391 sec/step) INFO:tensorflow:global step 72: loss = 1.5380 (0.422 sec/step) INFO:tensorflow:global step 73: loss = 0.3814 (0.422 sec/step) INFO:tensorflow:global step 74: loss = 0.6671 (0.406 sec/step) INFO:tensorflow:global step 75: loss = 1.8687 (0.422 sec/step) INFO:tensorflow:global step 76: loss = 1.4388 (0.438 sec/step) INFO:tensorflow:global step 77: loss = 5.7769 (0.391 sec/step) INFO:tensorflow:global step 78: loss = 15.7012 (0.391 sec/step) INFO:tensorflow:global step 79: loss = 18.4020 (0.391 sec/step) INFO:tensorflow:global step 80: loss = 0.1409 (0.422 sec/step) INFO:tensorflow:global step 81: loss = 41.4890 (0.391 sec/step) INFO:tensorflow:global step 82: loss = 391.0323 (0.391 sec/step) INFO:tensorflow:global step 83: loss = 1184.6780 (0.391 sec/step) INFO:tensorflow:global step 84: loss = 113833.9297 (0.406 sec/step) INFO:tensorflow:global step 85: loss = 1028554.8125 (0.422 sec/step) INFO:tensorflow:global step 86: loss = 2339703.2500 (0.406 sec/step) INFO:tensorflow:global step 87: loss = 4837331.0000 (0.406 sec/step) INFO:tensorflow:global step 88: loss = 120685024.0000 (0.406 sec/step) INFO:tensorflow:global step 89: loss = 3053880832.0000 (0.375 sec/step) INFO:tensorflow:global step 90: loss = 74913587200.0000 (0.406 sec/step) INFO:tensorflow:global step 91: loss = 299197333504.0000 (0.406 sec/step) INFO:tensorflow:global step 92: loss = 22885305417728.0000 (0.406 sec/step) INFO:tensorflow:global step 93: loss = 710914261123072.0000 (0.406 sec/step)

See the attached log file for more details. I was not able to locate any memory allocation problems during the execution. I also included the custom code I used to transform my data into a tfrecord file (taken from the create_pascal_tf_record.py script), the label map and config I used. The strange thing is that when I do exactly the same thing with a label map with only 1 label, the training works fine and converges. I know there are some mistakes for some labels associated to the bounding boxes, but they are not that frequent, and if it was a problem caused for example by too few data for some classes, isn't the training supposed only not to converge instead of displaying very high values? Do you have any idea of what can cause this behavior? Thanks a lot for your help.

Source code / logs

object_detection.zip

Rayndell commented 6 years ago

Hi, I am updating this subject since I currently haven't found the cause of this behavior. I carefully double checked my data and there is no more class errors. The thing is, when I assign the same class to every object everything works fine, the training converges and I obtain accurate detection results. When I assign the real class to each object the loss diverges very quickly. Do you have any insight?

ZOUHEIRBN commented 4 years ago

Although I'm working with one class, loss values converge towards 6 Values converge towards 0.4 using Faster RCNN though...

Lill98 commented 4 years ago

Have u solved it yet????

Rayndell commented 4 years ago

I'm closing this issue. The problem was me (I forgot to change the number of classes in the .config file which was set to 1 all along).