Loss diverges to very high values in Object Detection API

Rayndell commented 6 years ago

System information

What is the top-level directory of the model you are using: object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.8
Bazel version (if compiling from source): no
CUDA/cuDNN version: 9.0/7.0
GPU model and memory: NVIDIA GeForce GTX 1080 Ti
Exact command to reproduce: python train.py --logtostderr --pipeline_config_path=rfcn_resnet101_custom.config --train_dir=custom_train_dir

Describe the problem

While training a detection model with the object detection API (RFCN with Resnet v1 101 feature extractor) I encountered what looks like a bug to me (maybe an overflow). After learning for a little while without problems (with a custom dataset) the loss begins to display very large values that grow very quickly. At first I thought there was an overflow on the loss value but it seems more progressive: INFO:tensorflow:global step 69: loss = 0.5207 (0.422 sec/step) INFO:tensorflow:global step 70: loss = 0.2422 (0.406 sec/step) INFO:tensorflow:global step 71: loss = 0.5621 (0.391 sec/step) INFO:tensorflow:global step 72: loss = 1.5380 (0.422 sec/step) INFO:tensorflow:global step 73: loss = 0.3814 (0.422 sec/step) INFO:tensorflow:global step 74: loss = 0.6671 (0.406 sec/step) INFO:tensorflow:global step 75: loss = 1.8687 (0.422 sec/step) INFO:tensorflow:global step 76: loss = 1.4388 (0.438 sec/step) INFO:tensorflow:global step 77: loss = 5.7769 (0.391 sec/step) INFO:tensorflow:global step 78: loss = 15.7012 (0.391 sec/step) INFO:tensorflow:global step 79: loss = 18.4020 (0.391 sec/step) INFO:tensorflow:global step 80: loss = 0.1409 (0.422 sec/step) INFO:tensorflow:global step 81: loss = 41.4890 (0.391 sec/step) INFO:tensorflow:global step 82: loss = 391.0323 (0.391 sec/step) INFO:tensorflow:global step 83: loss = 1184.6780 (0.391 sec/step) INFO:tensorflow:global step 84: loss = 113833.9297 (0.406 sec/step) INFO:tensorflow:global step 85: loss = 1028554.8125 (0.422 sec/step) INFO:tensorflow:global step 86: loss = 2339703.2500 (0.406 sec/step) INFO:tensorflow:global step 87: loss = 4837331.0000 (0.406 sec/step) INFO:tensorflow:global step 88: loss = 120685024.0000 (0.406 sec/step) INFO:tensorflow:global step 89: loss = 3053880832.0000 (0.375 sec/step) INFO:tensorflow:global step 90: loss = 74913587200.0000 (0.406 sec/step) INFO:tensorflow:global step 91: loss = 299197333504.0000 (0.406 sec/step) INFO:tensorflow:global step 92: loss = 22885305417728.0000 (0.406 sec/step) INFO:tensorflow:global step 93: loss = 710914261123072.0000 (0.406 sec/step)

See the attached log file for more details. I was not able to locate any memory allocation problems during the execution. I also included the custom code I used to transform my data into a tfrecord file (taken from the create_pascal_tf_record.py script), the label map and config I used. The strange thing is that when I do exactly the same thing with a label map with only 1 label, the training works fine and converges. I know there are some mistakes for some labels associated to the bounding boxes, but they are not that frequent, and if it was a problem caused for example by too few data for some classes, isn't the training supposed only not to converge instead of displaying very high values? Do you have any idea of what can cause this behavior? Thanks a lot for your help.

Source code / logs

object_detection.zip

Rayndell commented 6 years ago

Hi, I am updating this subject since I currently haven't found the cause of this behavior. I carefully double checked my data and there is no more class errors. The thing is, when I assign the same class to every object everything works fine, the training converges and I obtain accurate detection results. When I assign the real class to each object the loss diverges very quickly. Do you have any insight?

ZOUHEIRBN commented 5 years ago

Although I'm working with one class, loss values converge towards 6 Values converge towards 0.4 using Faster RCNN though...

Lill98 commented 5 years ago

Have u solved it yet????

Rayndell commented 5 years ago

I'm closing this issue. The problem was me (I forgot to change the number of classes in the .config file which was set to 1 all along).

tensorflow / models