tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.35k stars 9.41k forks source link

Q&A: Training Wiki Updates and Request for Info #659

Open Shreeshrii opened 7 years ago

Shreeshrii commented 7 years ago

@theraysmith

Ray, Thanks for updating the Wiki page for LSTM training. A few more changes in the following may be required in light of the updates:

In theory it isn't necessary to have a base Tesseract of the same language as the neural net Tesseract, but currently it won't load without something there.

Finally, combine your new model with the language model files into a traineddata file:

Please also provide command for building traineddata with just the .lstm file or with just .lstm and lstm-dawgs (so as to minimize traineddata filesize, if only LSTM is going to be used).

Shreeshrii commented 7 years ago

Also helpful will be info on:

  1. how big the training text should be (number of lines) for:
  1. what kind of text is recommended/can be used?

e.g. For Sanskrit, I want to train by adding a layer using a list of most frequent orthographic syllables so that the unicharset is expanded to include all possible aksharas. Will this work?

  1. Should training be done using different --ptsize ? If so, is it possible to modify tesstrain.sh to take a list of --ptsize options (similar to the array for exposure --exp).
amitdo commented 7 years ago

My own question - the answer can also be added to the wiki.

Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan?

Shreeshrii commented 7 years ago

Also, is there a way for tesseract to create line boxes for a scanned image.

It will make it easier to put the truth text if the box dimensions are pre-made.

On 13-Jan-2017 2:14 PM, "Amit D." notifications@github.com wrote:

My own question - the answer can also be added to the wiki.

Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/659#issuecomment-272390460, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7lvUh_kbAfVygdwAU1ZBpPaiXCaks5rRzlqgaJpZM4LigRu .

amitdo commented 7 years ago

Also, is there a way for tesseract to create line boxes for a scanned image. It will make it easier to put the truth text if the box dimensions are pre-made.

This feature is not implemented. I will try to implement it sometime in the next few days and send a PR.

Shreeshrii commented 7 years ago

Another question:

what effect does the add a layer type of training have regarding the unicharset in the new traineddata.

For add a layer, a unicharset if required eg. lstmtraining -U ~/tesstutorial/bih/bih.unicharset Does this

Meaning, if we just want to add a few characters to the unicharset, is it enough to have good sampling of those or do characters from the lstm unicharset (which are unknown at this point) need to be there too.

Shreeshrii commented 7 years ago

Traineddata files in tessdata for 4.0 were trained with --perfect_sample_delay 19. The dafault value for the variable is 4.

The training command examples do not specify this. What are the recommended value to be used for finetuning and adding a layer?

Shreeshrii commented 7 years ago

@theraysmith

Please see https://groups.google.com/forum/#!topic/tesseract-ocr/-N5uPdSvJGA https://github.com/tesseract-ocr/tesseract/issues/642 https://github.com/tesseract-ocr/tesseract/issues/561

'core dumped' error in these cases seems to be related to using --eval_listfile as part of the lstmtraining command eg. --eval_listfile ~/tesstutorial/saneval/san.training_files.txt

Please update the wiki, if you can confirm this, so that people are able to run the tutorial.

Thanks.

Wikinaut commented 7 years ago

@amitdo Question to you, let me explain as briefly as I can:

I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected

Question

Is there an easy way - I guess, it could be possible and would be very userfriendly -

amitdo commented 7 years ago

Hi @Wikinaut!

Believe it or not, I haven't started yet playing with training the LSTM engine, so I don't know enough to answer your question. Hopefully, this serious 'bug' will be fixed sometime in the next month :-)

Some observations: Both 'für' and 'fiir' are in the wordlist. https://raw.githubusercontent.com/tesseract-ocr/langdata/master/deu/deu.wordlist

'ë' does not appear in the training text, 'é' appears 4 times. https://github.com/tesseract-ocr/langdata/blob/master/deu/deu.training_text

Café So für René für Cafés André

'für' appears 10 times in the training text.

OCR Engine modes: 0 Original Tesseract only. 1 Neural nets LSTM only. 2 Tesseract + LSTM. 3 Default, based on what is available.

Did you try --oem 1?

Wikinaut commented 7 years ago

@amitdo my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).

I also tried --oem 1. but found, that --oem 2 gave the best results. However, I did not find an explanation, what this "mixed operation modes" are really doing, pls. can we add a short text to `"2 Tesseract + LSTM", I can supply a PR, but do not know what a correct and short description is.

Wikinaut commented 7 years ago

@amitdo and regarding my question above, can I "quickly" retrain my "deu" training data (or a copy of it) with a corrected text, this would be really great?

Promise: some mBitcoins for this today!

Wikinaut commented 7 years ago

Whoever coded the LSTM: Big APPLAUSE for him or her!

amitdo commented 7 years ago

LSTM - New OCR engine based on neural networks. Tesseract - old OCR engine (started in the mid 80s) - does character segmentation and shape matching.

Wikinaut commented 7 years ago

@amitdo yes, but what if one selects --oem 2 ? Are then the results of both engines being compared or otherwise evaluated together ?

amitdo commented 7 years ago

The two engines runs and the results are combined in some way.

Wikinaut commented 7 years ago

:+1:

amitdo commented 7 years ago

As said, I have zero experience training the LSTM engine.

What you want is described here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Shreeshrii commented 7 years ago

@Wikinaut

my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).

Please provide a sample image for testing.

Wikinaut commented 7 years ago

@Shreeshrii

"für" vs.Tesseract: "fiir"

case 1 20170116-07 50 12_auswahl

case 2 20170116-07 52 18_auswahl

case 3 20170116-07 53 20_auswahl

"Citroën" vs. Tesseract: "Citroén"

Case 1 20170116-07 54 14_auswahl

Case 2 20170116-07 55 12_auswahl

Shreeshrii commented 7 years ago

ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.

für - is being recognized -see attached output files. e1ec5bba-dbc0-11e6-9e9a-8e65f50a9d60-oem1-png.txt beb10966-dbc0-11e6-8b86-f89117a7918c-oem1-png.txt 90011a70-dbc0-11e6-9889-d104dad6822a-oem1-png.txt

though ö was not recognized in one image.

amitdo commented 7 years ago

ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.

https://en.wikipedia.org/wiki/German_language#Orthography It's not in the a German alphabet. it's from French. Still, maybe it should be included with the deu traineddata.

amitdo commented 7 years ago

It does looks like 'ii' (two 'i's), doesn't it?

Maybe the training text needs some examples of 'ii' so it can learn to distinguish it from 'ü'.

Wikinaut commented 7 years ago

@Shreeshrii in my conversion, these words "für" were recognized as "fiir". May be due to use of "unpaper" as preprocessor, and/or my use of "-l deu+eng --oem 2" for the conversion.

There were many more occurences of false-detecting "fiir" in my about 700 pages of text. This was the most frequent conversion error and triggered me to aksing you how I could retrain tessdata by using my corrected text file. A simple command line would be very helpful for such cases.

@amitdo regarding "ii": In my text, tesseract correctly ocr-ed "ii" in the words "Gummiisolation", and "Daiichi" (a name).

Wikinaut commented 7 years ago

@theraysmith You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not:

(I described it already above:)

Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ?

re-running with re-trained tesseract' should in the best case result in

I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered.

theraysmith commented 7 years ago

This kind of retraining would be desirable, but is not available.

In your case, you don't need it though, as 4.00 works for all the examples of "für" that you provided. You just need to make sure you are using the latest code and data.

As Amit points out e diaresis is not in the German alphabet. I successfully correctly got "Citroën" by using fra+deu as the language. Unfortunately, it doesn't work with deu+fra, and neither works for the 2nd example. BTW this needed a bug fix for multi-language, which I will check in soon.

On Mon, Jan 16, 2017 at 9:26 AM, Wikinaut notifications@github.com wrote:

@theraysmith https://github.com/theraysmith You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not:

(I described it already above:)

Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ?

  • in.pdf -> tesseract -> out.txt
  • out.txt -> manually corrected -> corrected.txt
  • retraining tesseract (to get tesseract' )with these inputs: in.pdf + corrected.txt

re-running re-trained tesseract should in the best case result in

  • in-pdf -> tesseract' -> corrected.txt

I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/659#issuecomment-272920831, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056ZP1oiFo1bk3xNLf3lGGBiL7G-ODks5rS6hbgaJpZM4LigRu .

-- Ray.

Wikinaut commented 7 years ago

@theraysmith Thank you for your swift answer.

In my case, many "für" were detected as "fiir", when or when not using unpaper (I cannot remember, because I tried many different runs).

I will retry - and report here - with only -l deu in order to present a correct case for reproduction.

Wikinaut commented 7 years ago

@theraysmith to be more precise:

I tried tesseract with -oem 0, 1, 2 and found that "2" gave the best results (for a 700 pages scan). I rerun with and without unpaper, and found some differences. And I only used -l deu+eng, because my German text used some English terms. Now, as I have a manually corrected reference output text I can present (later) a kind of matrix with the results.

amitdo commented 7 years ago

New box renderer https://github.com/amitdo/tesseract/issues/3

stweil commented 7 years ago

Some of the problems with German texts were addressed in https://github.com/tesseract-ocr/langdata/pull/54, https://github.com/tesseract-ocr/langdata/pull/56 and https://github.com/tesseract-ocr/langdata/pull/57. I don't know whether those fixes are sufficient to improve future trainings.

Wikinaut commented 7 years ago

@stweil @amitdo Stefan, please can you also make sure that common words with a https://en.wikipedia.org/wiki/Diaeresis_(diacritic) (Deutsch: Trema) like Citroën are correctly trained ?

stweil commented 7 years ago

I addressed the more general question whether all European languages should support all typical diacritical characters in the tesseract-dev forum and need information from @theraysmith to proceed.

stweil commented 7 years ago

I successfully correctly got "Citroën" by using fra+deu as the language.

I expect that using additional languages has more side effects than recognizing additional characters, because they also add word list, unigram frequencies, word bigrams and so on for that languages which might have a negative effect on OCR results for texts which are mainly written in a single language but make sparely use of additional languages. Examples of such texts are German texts with foreign person or trade mark names, but also English scientific texts with additional Greek characters (a combination often used in mathematics and physics).

Wikinaut commented 7 years ago

@stweil Thanks for your swift answers. Let me know, if I can help.

amitdo commented 7 years ago

Wikinaut, you can try the new best/Latin.traineddata

Shreeshrii commented 7 years ago

@Wikinaut

I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected "Citroén" instead of the original word "Citroën" "fiir" instead of "für"

Does it work now with best traineddata?

Can I close this issue?

Wikinaut commented 7 years ago

I have not tried the latest version. Pls. let this open - I will close it, if it's solved.

amitdo commented 7 years ago

@Wikinaut,

The best/eng.traineddata doesn't have the marks you want.

Try the new best/Latin.traineddata.

stweil commented 7 years ago

The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.

Wikinaut commented 7 years ago

@stweil I now use the new https://github.com/tesseract-ocr/tessdata_best data, and found that a problem with lowercase vs. uppercase "s" exists, in a 1000-page text,

typical incorrectly detected word patterns are:

amitdo commented 7 years ago

The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.

Try to correct the mistakes in the wordlist and see if it helps to recognize these words.

stweil commented 7 years ago

... or run Tesseract without a wordlist. I recently removed the wordlists from the best traineddata to see and compare the real quality of the trained LSTM data. This is impossible when Tesseract uses a wordlist. With wordlists, Tesseract also invents words which don't occur in the original text ("computer" and "Google" in historical documents).

PS. Is there a parameter which disables the post OCR steps (like wordlist evaluation) in Tesseract without the need to remove the wordlists from the traineddata files?

amitdo commented 7 years ago

Yes, there is a parameter which disables the wordlist evaluation.

I don't remember its name right now...

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/tesseract/issues/960

I guess, you can make the following two config variables as false to not load the wordlist dawg files.

load_system_dawg T load_freq_dawg T

amitdo commented 7 years ago

The parameter is lstm_use_matrix.

amitdo commented 7 years ago

I guess, you can make the following two config variables as false to not load the wordlist dawg files. load_system_dawg T load_freq_dawg T

load_system_dawg should work.

load_freq_dawg seems to have no impact on the lstm recognizer.

Shreeshrii commented 7 years ago

Those config variables related to the legacy engine. New traineddata files have a different lstm-word-dawg and have no freq-dawg files. So, I am not sure whether they will work. I haven't tried it yet.

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tesseract/blob/27d25e9c99ca65a2137f54f4c9c2bd20fc050024/dict/dict.cpp#L307

stweil commented 7 years ago

I wonder why LSTM needs its own word list. I'd expect that a word list is different for different languages, and it is also reasonable to use different word lists for different kinds of text (topic, date) of the same language, but it should not depend on the OCR algorithm.

Shreeshrii commented 7 years ago

It is not that the wordlist is different, but the fact that the legacy engine and LSTM models might be using different unicharsets.

The creation and unpacking of dawgs requires unicharsets, that's why there are two sets of dawg files, even for numbers and punctuation, in addition to the wordlist.