tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 400 forks source link

Error while training (using frcnn) on new data set #245

Closed annusrcm closed 5 years ago

annusrcm commented 5 years ago

I used frcnn algo to detect table in document images. It finished and worked well on UNLV dataset, however when i use the new data set it runs for few images and then gives the following error

Traceback (most recent call last): File "/iq_storage/virtualenv/table_detection/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/iq_storage/virtualenv/table_detection/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/iq_storage/virtualenv/table_detection/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: losses/RPNLoss/background_cls_loss_1 [[Node: losses/RPNLoss/background_cls_loss_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](losses/RPNLoss/background_cls_loss_1/tag, losses/RPNLoss/boolean_mask_5/GatherV2)]]

Note : Using the sample code in this blog

nagitsu commented 5 years ago

It seems like the RPN loss is exploding. Can you show the full output of the command? Are the loss values increasing until they reach a NaN?

If that's the case, you can try lowering the learning rate a bit. Can you show us the config file you're using?

annusrcm commented 5 years ago

Hi @nagitsu

folder5_training_fail The loss values are very very large, check for step 26, 27, 28 in the image attached.

The contents of the config.yml : config.txt

nagitsu commented 5 years ago

The learning rate's definitely too high, try reducing it a bit and check if it stops exploding. You can read more on how to do this in the tutorial from the docs.

Let me know how it goes :)

annusrcm commented 5 years ago

Hi @nagitsu I ran the training (everything remaining same) to reproduce the error of having very large training loss but this time the loss is not exploding, in the first epoch the training reached upto some 500 steps and it is still below 10. In my config file I had not defined any learning rate so I am assuming the library is using the default learning rate of 0.0003 as defined in base_config.yml. I am expecting it to explode in some time.

Now I have reduced the learning rate to 0.0001 with exponential decay and will update about it as the training finishes.

annusrcm commented 5 years ago

Hi @nagitsu With the following configurations for learning rate: decay_method: exponential_decay decay_rate: 0.5 decay_steps: 5000 learning_rate: 0.0003 The problem of exploding RPN loss vanished, however I am getting other error: Reduction axis 0 is empty in shape [0,2] Shall I raise another issue or post the full error traceback here ?

nagitsu commented 5 years ago

The error seems unrelated, please open a new issue with the full traceback :)

By the way, you might want to try a less abrupt decay, such as this:

learning_rate:
  decay_method: piecewise_constant
  boundaries: [250000, 450000, 600000]
  values: [0.0003, 0.0001, 0.00003, 0.00001]

The boundaries themselves would be different, depending on when the loss stabilizes (you can look at the Tensorboard charts for the losses).