tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.35k stars 9.52k forks source link

LSTM: Training - converging and accuracy problems #734

Closed Shreeshrii closed 7 years ago

Shreeshrii commented 7 years ago

@theraysmith

Please see detailed report at https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/LUHy-niac6U/7oZgqIWLCwAJ

Copied message:

I have been trying to train Tesseract 4.0 with my own data in order to extract text as a mix of natural language words and domain-specific (non-natural language) words (acronyms, identifiers, abbreviations). The Tesseract standard model has troubles in recognizing domain-specific words where “visual” words from source are either dropped or recognized with missing parts in them. So, I decided to train my own model.

I went through tutorials, set up a number of experiments, but so far with no real success. While I could fix the entirely dropped words problem by lowering the hard coded confidence threshold, and a partial success in recognizing domain-specific words, the accuracy on natural language words went down.

Two observations I have made so far by following experiments:

In Experiment 1 I use the available data (as it is) for training (~1 M tokens, and ~150 fonts). After that I generated an evaluation data set for another ~200 k tokens and ~15 most relevant fonts. Then, I trained the model by replacing the top layer from the existing Tesseract traineddata as described at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer The training converged a couple of days later and I evaluated the model on a held out dataset with gold standard (tiff – plain txt). The accuracy I received was lower than by using the standard Tesseract model. I noticed that the model is able to recognize some (not all) domain-specific words, however the performance on the natural language words went down (where the standard model worked fine). So, I analyzed the errors and designed another experiment in which I addressed the observed errors, which were, in my opinion, caused by data skewness = confusions between characters in rare and complex contexts. In Experiment 2 I used the entire data set I have (~120 M tokens) and extracted word and char bigram statistics. Then I took all words with frequencies over a certain threshold as part of the final training data set. In addition, I boosted word statistics for words containing low-frequency char bi-grams (which made me troubles in the experiment before) and appended them into the final training data set. In the end, it resulted in a ~600 k unique words training data set. This was then rendered with ~ 150 fonts into tiffs, the evaluation data set remained a natural language text of ~200 k tokens in ~15 most relevant fonts. It turned out that training converges too slow – it has been running for over a week now with the best model of a ~ 0.17% error rate . Evaluations of pairs of different subsequent model snapshots on the held out dataset showed no general improvement over each other, rather random fluctuations between more accurate natural language words vs. domain-specific and vice versa. More interesting, models with lower char error rate (< 0.5%) perform worse (especially on natural language words) than models with higher char error rate (~ 0.5%). I also noticed that the model captures “language modeling features” which makes the recognition of misspelled words, “non-natural language” unique identifiers and acronyms difficult. Moreover, unique identifiers, rare words etc. in text are a big problem, however can be already recognized in chunks, but not as a whole word. More specifically, trouble cases are “like-this”, “like/this” or “this-or-like-this”.

At this point I am doubting the way I am training Tesseract is correct. So I would like to ask the community the following questions: Should I use a natural language text or a dictionary of words for training and evaluation data set? How important is the effect of token redundancy? (Are the errors in recognition of natural language words caused by the only single instances of those words in the training data?) How to get Tesseract to recognize freely generated tokens not available in the training dataset?

Thanks, Alex

Shreeshrii commented 7 years ago

I have had similar results, though with smaller training sets, both with Devanagari and san_latn (Sanskrit in Latin Script).

kolomiyets commented 7 years ago

Update: After preparing new training and evaluation data, I trained a new model once again. Now, the data comprises 4,6M tokens of about 600k words (with a controlled, "shape-similar" to real word distribution) for 150 fonts. Then, I used ~ 80 % of data for training and 20 % as "held out" eval data. After about 6 weeks (:)) of training the model converged to 0.01. Then I took a single page of the training TIFF file and ran Tesseract on that page.

Despite a high quality level for "alpha-numerical" words, ALL words which contain "-" are wrong. I thought that the word I was looking at is the one of those which cause 0.01 char error rate, I dug deeper into the log file. I see the word as well as the entire line were perfectly recognized during training. Moreover, this word exists at least 5 times in the training text, and thus at least ~750 times in all fonts.

So, I am a bit confused that the words which were perfect during training are wrong when evaluated separately. And, it is the case for all words containing "-".

Below I provide an example: line

TSV for the line above: 5 1 1 1 2 1 113 188 467 50 51 Fernsehtechnologien 5 1 1 1 2 2 597 191 407 48 51 Tontinengeschäfte 5 1 1 1 2 3 1022 193 350 39 52 Konzernsteuern 5 1 1 1 2 4 1391 194 343 49 51 Kartografierung 5 1 1 1 2 5 1748 196 447 49 0 Avantgarde-ösung, <------- L is dropped 5 1 1 1 2 6 2216 191 467 56 52 Ölversorgungsrouten 5 1 1 1 2 7 2700 201 240 38 40 verschwört 5 1 1 1 2 8 2958 201 429 49 54 Dialoginstrumenten

While training this line has been used and : Iteration 3826338: ALIGNED TRUTH : Fernsehtechnologien Tontinengeschäfte Konzernsteuern Kartografierung Avantgarde-Lösung, Ölversorgungsrouten verschwört Dialoginstrumenten Iteration 3826338: BEST OCR TEXT : Fernsehtechnologien Tontinengeschäfte Konzernsteuern Kartografierung Avantgarde-Lösung, Ölversorgungsrouten verschwört Dialoginstrumenten File /tmp/tmp.F8zm7RXvJr/deu/deu.Microsoft_Sans_Serif.exp0.lstmf page 20 (Perfect):

Can someone help me to find out what I have been doing wrong?

Thanks, Alex

Shreeshrii commented 7 years ago

LSTM training process has been changed. Hence closing this issue.

nissansz commented 2 years ago

Did you train base on sentence ? steve8000818@gmail.com