githubharald / SimpleHTR

Handwritten Text Recognition (HTR) system implemented with TensorFlow.
https://towardsdatascience.com/2326a3487cd5
MIT License
1.99k stars 893 forks source link

Error in model training #53

Closed ritzyag closed 5 years ago

ritzyag commented 5 years ago

Hi,

I have trained the model from scratch for 100 epochs. Now when I retrain the model again, initializing weights from the model earlier trained (on 100 epoch) it seems that it starts training from scratch again. Because the loss is very high and accuracy very low in retraining. There is also some weird message which is displayed "2019-03-02 12:39:12.753314: tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr"

I would be very grateful if you could help me through it.

Below is the logs. The first part of the logs shows the loss and accuracy on train set in the 100th epoch.And then I restart training, initialising weights from the previous model, but the loss is very high and that weird message is displayed. :(

Thank You! Epoch: 100 Train NN Batch: 1 / 18 Loss: 0.32776487 Batch: 2 / 18 Loss: 0.60198474 Batch: 3 / 18 Loss: 0.48517135 Batch: 4 / 18 Loss: 0.27418187 Batch: 5 / 18 Loss: 0.55649453 Batch: 6 / 18 Loss: 0.26786774 Batch: 7 / 18 Loss: 0.34285638 Batch: 8 / 18 Loss: 0.20939198 Batch: 9 / 18 Loss: 0.27928185 Batch: 10 / 18 Loss: 0.6662815 Batch: 11 / 18 Loss: 0.5485757 Batch: 12 / 18 Loss: 0.5808813 Batch: 13 / 18 Loss: 0.9294588 Batch: 14 / 18 Loss: 0.86137664 Batch: 15 / 18 Loss: 0.50504154 Batch: 16 / 18 Loss: 0.68977255 Batch: 17 / 18 Loss: 0.9456356 Batch: 18 / 18 Loss: 0.41855794 Character train error rate: 1.444444%. Word train accuracy: 92.111111%. Validate NN Batch: 1 / 8 Batch: 2 / 8 Batch: 3 / 8 Batch: 4 / 8 Batch: 5 / 8 Batch: 6 / 8 Batch: 7 / 8 Batch: 8 / 8 Character dev error rate: 13.708333%. Word dev accuracy: 51.750000%. Character error rate not improved (SimpleHTR) shipsy@shipsy-pc:~/Ritika/Text_Recognition/SimpleHTR/src$ python main1.py --train Python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] Tensorflow: 1.12.0 2019-03-02 12:39:07.470070: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2019-03-02 12:39:07.474119: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. Init with stored values from ../model/snapshot-34 Epoch: 1 Train NN Batch: 1 / 18 Loss: 0.27486247 Batch: 2 / 18 Loss: 12.9338875 Batch: 3 / 18 Loss: 59.33571 2019-03-02 12:39:12.753314: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:12.753377: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 4 / 18 Loss: 59.97041 Batch: 5 / 18 Loss: 37.964706 Batch: 6 / 18 Loss: 58.27094 Batch: 7 / 18 Loss: 35.533077 Batch: 8 / 18 Loss: 33.15117 Batch: 9 / 18 Loss: 25.096054 2019-03-02 12:39:18.523761: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:18.523794: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 10 / 18 Loss: 14.908901 2019-03-02 12:39:19.445465: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:19.445845: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 11 / 18 Loss: 14.322981 2019-03-02 12:39:20.312253: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:20.312287: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 12 / 18 Loss: 13.987829 2019-03-02 12:39:21.205180: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:21.205211: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 13 / 18 Loss: 14.24977 2019-03-02 12:39:21.989036: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:21.989381: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 14 / 18 Loss: 14.085554 2019-03-02 12:39:22.900757: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:22.901134: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 15 / 18 Loss: 13.849102 2019-03-02 12:39:23.766655: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:23.766704: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 16 / 18 Loss: 13.1134615 2019-03-02 12:39:24.642604: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:24.642637: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 17 / 18 Loss: 13.719116 2019-03-02 12:39:25.501310: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:25.501352: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 18 / 18 Loss: 14.152748 2019-03-02 12:39:26.263092: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:26.263124: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Character train error rate: 93.259259%. Word train accuracy: 0.111111%. Validate NN Batch: 1 / 8 2019-03-02 12:39:26.595233: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:26.595940: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 2 / 8 2019-03-02 12:39:27.060717: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:27.060748: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 3 / 8 2019-03-02 12:39:27.378411: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:27.378444: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 4 / 8 2019-03-02 12:39:27.926617: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:27.926659: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 5 / 8 2019-03-02 12:39:28.494950: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:28.494982: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 6 / 8 2019-03-02 12:39:28.825010: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:28.825041: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 7 / 8 2019-03-02 12:39:29.049452: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:29.049822: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Batch: 8 / 8 2019-03-02 12:39:29.292437: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr 2019-03-02 12:39:29.292787: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr Character dev error rate: 100.000000%. Word dev accuracy: 0.000000%. Character error rate improved, save model

The logs is not complete. This message "2019-03-02 12:39:29.292787: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr" stops showing after training more epochs. Also, it is displayed everytime, whether I train from scratch or initialise weights from previous models.

ritzyag commented 5 years ago

I could solve the tensorflow error message "2019-03-02 12:39:29.292787: E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr" by making a new virtual env and installing packages from scratch.

But the retraining of model from checkpoint still gives very high loss and very low accuracy. I do not understand why. When I have restored the weights from checkpoint then the training should start from the restored checkpoint and the loss should further decrease (on train data). But the loss shoots out of range in the second batch of first epoch itself. Please resolve this bug.

Training again from scratch every time is very monotonous.

Thank You!

githubharald commented 5 years ago

set learning rate to a fixed number, e.g. 0.0001: https://github.com/githubharald/SimpleHTR/blob/master/src/Model.py#L212

ritzyag commented 5 years ago

Firstly, I would like to extend my heartful thanks to you for being so considerate in taking the time out to reply to our issues. It really means a lot! 💯

Now coming back to what you suggested I believe that might help. But if that is the issue then the loss in the first batch should also be very high. But when the retraining starts then the loss is very low for the first batch. Why? The learning rate is consistent for the first ten batches and so must the losses be during the first ten batches. Am I missing something?