tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.39k stars 9.42k forks source link

Fine Tuning Leads to Segmentation Issue #2132

Open jaddoughman opened 5 years ago

jaddoughman commented 5 years ago

Environment

Current Behavior:

I wanted to OCR a large dataset of Arabic newspapers with difficult delimiters and spacing. After running your original pre-trained model, I managed to recall about 80% of the required data. I opted to fine tune your existing ara.traineddata file by using text lines as my training and test data set. I used the "OCR-d Train" tool on GitHub to generate the neccessary .box. files.

Throughout the fine tuning process, the Eval percentages decreased tremendously, which means that the model was successfully trained. I re-evaluated using my own method and confirmed the successful training process.

However, the test dataset used was made up of text lines. So, your and my evaluation were generated on a text line level. The issue occurred when I ran the fine tuned model on a complete Newspaper sample (constituted of the same text line fonts). The accuracy decreased significantly compared to your original pre-trained model. This made no sense at all, my fine tuned model has better accuracy than your model on a text line level, but when running it on a complete newspaper (constituted of the same text line fonts), your pre-trained model is performing better than my successfully fine-tuned model.

The issue seems to be connected to your segmentation algorithm. This is a major problem, since this means that your training tool only works on a text line level and cannot be applied to any other form of dynamic text extraction. You will find below a sample newspaper, my fined tuned model, and the learning curve from the training process.

Sample Newspaper: Sample Newspaper.zip

Fine Tuned Model: ara_finetuned.traineddata.zip

Learning Curve: Learning Curve (60k Iterations).pdf

jaddoughman commented 5 years ago

The font family can be easily found, since the images are from a well-known newspaper which uses consistent font families throughout the archive. However, the bigger issue is the altered word detection post-training. You attempted to train on a certain font family, and the results were worse than the pre-trained model. My questions is how is fine-tuning a model decreasing the accuracy ? Also, how is fine-tuning altering the detection of the word itself.

The OCR Process as you know has 4 main steps: 1) Binarization 2) Segmentation 3) Classification 4) Post-Processing

Word detection occurs prior to the classification of the letters themselves. The generated layout analysis attached above shows an altered and incorrect word detection for the trained model. The above questions should be included in your updated version of Tesseract. OCRing the 185K archive is part of a research paper, investing months to train Tesseract shouldn't go to waste. I have a lot of samples if you wish to experiment on.

@Shreeshrii @amitdo @stweil

Shreeshrii commented 5 years ago

Google/Ray have not shared the training text used for LSTM training for Arabic, so we only have the 80 lines from langdata repo. Finetuning works best, AFAIK, when the original training text is used with minimal changes. Trying a different text leads to worse results, as you have pointed out.

zdenop commented 5 years ago

@jaddoughman: As far as I understand Cognitive Services Arabic OCR API is part of Microsoft Computer Vision which is alternative for Cloud Vision and not for tesseract. These kind of services are not free and neither open source.

amitdo commented 4 years ago

The issue is still caused by word detection, since a fine tuned model would never perform worse than the original one.

Your assumption is wrong.

https://github.com/tesseract-ocr/tesseract/issues/2132#issuecomment-450518730

As Shree pointed out, you should not train too much lines with the same font. It will lead to overfitting.