question: How to Diagnose Overfitting and Underfitting of Tesseract Models?

Shreeshrii commented 3 years ago

Is there a way to diagnose Overfitting and Underfitting of Tesseract Models?

@stweil had suggested in a different thread that one should evaluate all models to find the best fit for the eval data.

Is there a way to extract the checkpoint details of number of iterations and training CER and CER from lstmeval to find the best model and graph the results, similar to the output shown in this article for Keras.

wrznr commented 3 years ago

@Shreeshrii Not sure about lsmeval but the file names of the checkpoints itself encode training CER and number of iterations:

$ ls -lrt
total 891036
-rw-rw-r-- 1 kmw kmw 34427398 Sep  2 13:13 boernerianus_checkpoint
-rw-rw-r-- 1 kmw kmw 17212641 Sep  2 13:13 boernerianus_71.96_9294_9300.checkpoint
-rw-rw-r-- 1 kmw kmw 17212665 Sep  2 13:13 boernerianus_68.251_9493_9500.checkpoint
-rw-rw-r-- 1 kmw kmw 17212689 Sep  2 13:13 boernerianus_64.962_9689_9700.checkpoint
-rw-rw-r-- 1 kmw kmw 17212713 Sep  2 13:13 boernerianus_62.498_9889_9900.checkpoint
-rw-rw-r-- 1 kmw kmw 17212737 Sep  2 13:13 boernerianus_61.812_10087_10100.checkpoint
-rw-rw-r-- 1 kmw kmw 17212749 Sep  2 13:13 boernerianus_59.662_10185_10200.checkpoint
-rw-rw-r-- 1 kmw kmw 17212785 Sep  2 13:13 boernerianus_57.284_10471_10500.checkpoint
-rw-rw-r-- 1 kmw kmw 17212821 Sep  2 13:13 boernerianus_54.814_10761_10800.checkpoint
-rw-rw-r-- 1 kmw kmw 17212857 Sep  2 13:13 boernerianus_52.799_11221_11300.checkpoint
-rw-rw-r-- 1 kmw kmw 17212917 Sep  2 13:13 boernerianus_49.958_11685_11800.checkpoint
-rw-rw-r-- 1 kmw kmw 17212941 Sep  2 13:13 boernerianus_45.862_11873_12000.checkpoint
-rw-rw-r-- 1 kmw kmw 17212953 Sep  2 13:13 boernerianus_44.034_11965_12100.checkpoint
-rw-rw-r-- 1 kmw kmw 17212965 Sep  2 13:13 boernerianus_41.24_12054_12200.checkpoint
-rw-rw-r-- 1 kmw kmw 17212977 Sep  2 13:13 boernerianus_38.882_12145_12300.checkpoint
-rw-rw-r-- 1 kmw kmw 17212989 Sep  2 13:13 boernerianus_36.6_12242_12400.checkpoint
-rw-rw-r-- 1 kmw kmw 17213001 Sep  2 13:13 boernerianus_32.697_12328_12500.checkpoint
-rw-rw-r-- 1 kmw kmw 17213013 Sep  2 13:13 boernerianus_30.751_12416_12600.checkpoint
-rw-rw-r-- 1 kmw kmw 17213025 Sep  2 13:13 boernerianus_28.325_12509_12700.checkpoint
-rw-rw-r-- 1 kmw kmw 17213037 Sep  2 13:13 boernerianus_26.503_12602_12800.checkpoint
-rw-rw-r-- 1 kmw kmw 17213049 Sep  2 13:13 boernerianus_24.218_12681_12900.checkpoint

You can always “finish” a checkpoint and convert it into a working Tesseract model via lstmtraining --stop_training. In doing so, you can create a set of models which than can be tested against an evaluation set but may also be used for superior calamari-like application of different training stages at once.

Shreeshrii commented 3 years ago

you can create a set of models which than can be tested against an evaluation set

I have recently tried, splitting the training data into three sets. Eg. Using 80% for training and keeping 10% for eval during training and 10% for validation test with the traineddata files from checkpoints.

I find that there are differences in the training CER for a checkpoint and the validation set CER for traineddata from same checkpoint. The CER from the eval set used during training is not easily available.

Earlier I thought that the model with the lowest training CER was the best. However after running tests with the validation set, that does not seem to be true, because that might have been overfitted to the training set.

Hence my question ...

Shreeshrii commented 3 years ago

used for superior calamari-like application of different training stages at once.

I am not familiar with this. Please elaborate or provide a link with further info.

Shreeshrii commented 3 years ago

Here are my results from a validation run:

At iteration 0, stage 0, Eval Char error rate=3.5101527, Word error rate=9.2906778 ***** iast_0.706_419620_1359700.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.5149699, Word error rate=9.0415867 ***** iast_0.576_444984_1500100.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.5472042, Word error rate=9.4882442 ***** iast_0.879_361194_1071400.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.5526628, Word error rate=9.5266959 ***** iast_0.922_361153_1071200.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.6000142, Word error rate=9.2806764 ***** iast_0.598_438064_1461300.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.6119001, Word error rate=9.6337073 ***** iast_0.817_366009_1094100.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.6328469, Word error rate=9.6109486 ***** iast_0.651_419705_1360200.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.6358543, Word error rate=9.7243117 ***** iast_1.113_296390_804400.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.6379109, Word error rate=9.5886973 ***** iast_1.063_297779_809900.checkpoint-fast.traineddata **** gt/list.test 
At iteration 0, stage 0, Eval Char error rate=3.6440435, Word error rate=9.4520018 ***** iast_0.786_383730_1177900.checkpoint-fast.traineddata **** gt/list.test

Shreeshrii commented 3 years ago

The CER from the eval set used during training is not easily available.

I was able to extract it from the training log file.


At iteration 13291/15800/15800, Mean rms=0.696%, delta=2.443%, char train=8.304%, word train=21.54%, skip ratio=0%,  New worst char error = 8.304
At iteration 7443, stage 0, Eval Char error rate=10.111046, Word error rate=26.474236 wrote checkpoint.

At iteration 16083/19800/19800, Mean rms=0.661%, delta=2.288%, char train=7.113%, word train=19.995%, skip ratio=0%,  New worst char error = 7.113
At iteration 12273, stage 1, Eval Char error rate=7.6912357, Word error rate=21.70712 wrote checkpoint.

At iteration 19430/24800/24800, Mean rms=0.645%, delta=2.155%, char train=6.618%, word train=19.316%, skip ratio=0%,  New worst char error = 6.618
At iteration 15033, stage 1, Eval Char error rate=6.892947, Word error rate=20.06269 wrote checkpoint.

At iteration 21959/28800/28800, Mean rms=0.603%, delta=1.953%, char train=6.338%, word train=19.004%, skip ratio=0%,  New worst char error = 6.338
At iteration 18400, stage 1, Eval Char error rate=6.493667, Word error rate=18.827539 wrote checkpoint.

At iteration 25842/35100/35100, Mean rms=0.6%, delta=2.055%, char train=6.291%, word train=17.961%, skip ratio=0%,  New worst char error = 6.291
At iteration 20948, stage 1, Eval Char error rate=6.2389766, Word error rate=18.548224 wrote checkpoint.

At iteration 28482/39500/39500, Mean rms=0.573%, delta=1.864%, char train=5.943%, word train=17.403%, skip ratio=0%,  New worst char error = 5.943
At iteration 24784, stage 1, Eval Char error rate=5.6843931, Word error rate=16.798239 wrote checkpoint.

At iteration 31742/45000/45000, Mean rms=0.56%, delta=1.862%, char train=6.286%, word train=16.177%, skip ratio=0%,  New worst char error = 6.286
At iteration 27814, stage 1, Eval Char error rate=5.2523296, Word error rate=15.64968 wrote checkpoint.

At iteration 34953/50700/50700, Mean rms=0.531%, delta=1.617%, char train=4.942%, word train=14.97%, skip ratio=0%,  New worst char error = 4.942
At iteration 30103, stage 1, Eval Char error rate=5.283013, Word error rate=16.057397 wrote checkpoint.

At iteration 37589/55500/55500, Mean rms=0.496%, delta=1.382%, char train=4.285%, word train=14.018%, skip ratio=0%,  New best char error = 4.285
At iteration 33926, stage 1, Eval Char error rate=5.0264877, Word error rate=15.354675 wrote checkpoint.

The eval iterations lag behind the training iterations.

Shreeshrii commented 3 years ago

I was able to pull out the training CER and eval CER data from the lstmtraining log and then plot it. Validation run CER is not included here.

Here is the script to extract the data and plot it.

grep 'Eval Char' /home/ubuntu/tess5training-iast/LAYER.log | sed -e 's/^.*[0-9]At iteration //' | \sed -e 's/,.*=/\t/'  | sed -e 's/ wrote.*$//' | sed -e 's/^/\t\t\t/'> plot-eval.txt
grep 'best model' /home/ubuntu/tess5training-iast/LAYER.log |  sed  -e 's/^.*\///' |  sed  -e 's/\.checkpoint.*$//' | sed  -e 's/_/\t/g' > plot-best.txt
grep 'At iteration' /home/ubuntu/tess5training-iast/LAYER.log |  sed -e '/^Sub/d' |  sed -e '/^Update/d' | sed  's/, Mean.*char train=/\t\t/g' |  sed  -e 's/%, word.*$//' | sed  -e 's/At iteration.*\//\t\t\t/g' > plot-iteration.txt
cat plot-header.txt plot-best.txt plot-eval.txt plot-iteration.txt > plot.csv
python plot.py

plot.py

import pandas as pd
import matplotlib.pyplot as plt
dataframe = pd.read_csv("plot.csv",sep='\t', encoding='utf-8')
x = dataframe.TrainingIteration
y = dataframe.TrainingCER
z = dataframe.EvalCER
w = dataframe.IterationCER
plt.title('tess5training-iast - Training and Evaluation CER')
plt.xlabel('Iterations')
plt.ylabel('Character Error Rate')
plt.plot(x, y, 'b', label='Training CER at best model checkpoints')
plt.scatter(x, w, s=1, c='magenta', label='Training CER')
plt.scatter(x, y, c='blue', s=10, label='Best model checkpoint CER')
plt.plot(x, z, 'r', label='Evaluation CER during lstmtraining')
plt.scatter(x, z, c='red', s=10, label='Evaluation CER')
plt.legend()
xcoords = [150000]
for xc in xcoords:
    plt.axvline(x=xc, color='k', linestyle='dotted', ymin=0.0, ymax=0.25)
plt.savefig("plot.png")

plot

EDIT: This incorrectly uses WER for one of the data columns, so plot is not accurate.

Looks like my problem might be using too many lines of training data (about 150000+) and then killing the training process without allowing for enough epochs.

EDIT: Plot with correct CER data

plot

wrznr commented 3 years ago

Wow, that's a great tool. Could you add it to the repo and we could think about utilizing it in the Makefile?

Looks like my problem might be using too many lines of training data

The diifference between test and eval is remarkable and could be an indicator for having lines the evaluation set which are very different from what is seen during training. Why is the number of data points for Evaluation CER so small?

wrznr commented 3 years ago

I am not familiar with this. Please elaborate or provide a link with further info.

Calamari writes a number of models during its training process (not only the best as Tesseract does). In their papers, @chreul and colleagues propose a strategy called voting -- which is more or less a majority decision over the outputs of multiple models during the recognition stage -- and use all the models created during training as voters.

Now, Tesseract used a different kind of combination of multiple OCR models. It returns the character (or character sequence) which received the highest probability from one of the models. Usually you use e.g. a Greek, a Latin and a Fraktur model when running Tesseract. What one could try to do instead/in addition is the combination of Tesseract models from different stages of the training process (i.e. checkpoints).

Shreeshrii commented 3 years ago

What one could try to do instead/in addition is the combination of Tesseract models from different stages of the training process (i.e. checkpoints).

Yes, if I remember correctly, @stweil had recommended using last three checkpoints models to get better results.

Shreeshrii commented 3 years ago

The diifference between test and eval is remarkable

The CER is from lstmtraining and the eval that it runs during lstmtraining using the eval list.

Edit: I had accidentally used WER instead of CER. Corrected plot does not have that much difference.

Why is the number of data points for Evaluation CER so small?

I do not know the algorithm that tesseract uses internally to decide when to run the eval. @stweil maybe able to explain it.

I have not yet run lstmeval on the validation set yet. I will do that and add to the plot.

Shreeshrii commented 3 years ago

@wrznr Thanks for pointing out the difference between the evaluation and training error rates. I had accidentally used the WER from the log file instead of CER, for the evaluation run. Have fixed it now.

I can submit the python script as a PR but it has hardcoded values pertaining to my data. It will need to be generalized for use with makefile.

Here are the plots from two different runs of training for Sanskrit which has support for Devanagari, English and IAST (Diacritics to support Sanskrit).

plot-old

In the old run displayed above, I restarted training a few times, that may account for sharp changes in the eval values. It also has the plot from the validation run.

plot

wrznr commented 3 years ago

@Shreeshrii Great stuff!

Since the evaluation error is still on par or even below the training error, you can rule out overfitting and go on to train more iterations.

Shreeshrii commented 3 years ago

New improved version of plotting in PR https://github.com/tesseract-ocr/tesstrain/pull/218#issuecomment-747466387

meetyogi98 commented 3 years ago

The CER from the eval set used during training is not easily available.

I was able to extract it from the training log file.


At iteration 13291/15800/15800, Mean rms=0.696%, delta=2.443%, char train=8.304%, word train=21.54%, skip ratio=0%,  New worst char error = 8.304
At iteration 7443, stage 0, Eval Char error rate=10.111046, Word error rate=26.474236 wrote checkpoint.

At iteration 16083/19800/19800, Mean rms=0.661%, delta=2.288%, char train=7.113%, word train=19.995%, skip ratio=0%,  New worst char error = 7.113
At iteration 12273, stage 1, Eval Char error rate=7.6912357, Word error rate=21.70712 wrote checkpoint.

At iteration 19430/24800/24800, Mean rms=0.645%, delta=2.155%, char train=6.618%, word train=19.316%, skip ratio=0%,  New worst char error = 6.618
At iteration 15033, stage 1, Eval Char error rate=6.892947, Word error rate=20.06269 wrote checkpoint.

At iteration 21959/28800/28800, Mean rms=0.603%, delta=1.953%, char train=6.338%, word train=19.004%, skip ratio=0%,  New worst char error = 6.338
At iteration 18400, stage 1, Eval Char error rate=6.493667, Word error rate=18.827539 wrote checkpoint.

At iteration 25842/35100/35100, Mean rms=0.6%, delta=2.055%, char train=6.291%, word train=17.961%, skip ratio=0%,  New worst char error = 6.291
At iteration 20948, stage 1, Eval Char error rate=6.2389766, Word error rate=18.548224 wrote checkpoint.

At iteration 28482/39500/39500, Mean rms=0.573%, delta=1.864%, char train=5.943%, word train=17.403%, skip ratio=0%,  New worst char error = 5.943
At iteration 24784, stage 1, Eval Char error rate=5.6843931, Word error rate=16.798239 wrote checkpoint.

At iteration 31742/45000/45000, Mean rms=0.56%, delta=1.862%, char train=6.286%, word train=16.177%, skip ratio=0%,  New worst char error = 6.286
At iteration 27814, stage 1, Eval Char error rate=5.2523296, Word error rate=15.64968 wrote checkpoint.

At iteration 34953/50700/50700, Mean rms=0.531%, delta=1.617%, char train=4.942%, word train=14.97%, skip ratio=0%,  New worst char error = 4.942
At iteration 30103, stage 1, Eval Char error rate=5.283013, Word error rate=16.057397 wrote checkpoint.

At iteration 37589/55500/55500, Mean rms=0.496%, delta=1.382%, char train=4.285%, word train=14.018%, skip ratio=0%,  New best char error = 4.285
At iteration 33926, stage 1, Eval Char error rate=5.0264877, Word error rate=15.354675 wrote checkpoint.

The eval iterations lag behind the training iterations.

Hi @Shreeshrii I'm referring to Ray Smith blog https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html In the blog there it is mentioned to use following command for training training/lstmtraining --debug_interval 100 \ --continue_from ~/tesstutorial/eng_from_chi/eng.lstm \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --append_index 5 --net_spec '[Lfx256 O1c111]' \ --model_output ~/tesstutorial/eng_from_chi/base \ --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 3000 &>~/tesstutorial/eng_from_chi/basetrain.log

I'm mentioning the eval_listfile but I'm not getting eval char error rate while training like you get in training. How can I get such logs in training so that it would be better for me to evaluate the model?

wrznr commented 2 years ago

Meanwhile, it turned out that the error values estimated during the training process are not to be trusted: #261

bertsky commented 6 months ago

Meanwhile, it turned out that the error values estimated during the training process are not to be trusted: #261

This has been mitigated to some extent by https://github.com/tesseract-ocr/tesseract/pull/3644, which changed the descriptions given during training to reflect more honestly what is calculated.

Regarding the OP's question and @Shreeshrii's proposal for a plotting facility, #377 is the latest incarnation of this.

tesseract-ocr / tesstrain

question: How to Diagnose Overfitting and Underfitting of Tesseract Models? #200