Open Shreeshrii opened 7 years ago
Also helpful will be info on:
e.g. For Sanskrit, I want to train by adding a layer using a list of most frequent orthographic syllables so that the unicharset is expanded to include all possible aksharas. Will this work?
My own question - the answer can also be added to the wiki.
Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan?
Also, is there a way for tesseract to create line boxes for a scanned image.
It will make it easier to put the truth text if the box dimensions are pre-made.
On 13-Jan-2017 2:14 PM, "Amit D." notifications@github.com wrote:
My own question - the answer can also be added to the wiki.
Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/659#issuecomment-272390460, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7lvUh_kbAfVygdwAU1ZBpPaiXCaks5rRzlqgaJpZM4LigRu .
Also, is there a way for tesseract to create line boxes for a scanned image. It will make it easier to put the truth text if the box dimensions are pre-made.
This feature is not implemented. I will try to implement it sometime in the next few days and send a PR.
Another question:
what effect does the add a layer type of training have regarding the unicharset in the new traineddata.
For add a layer, a unicharset if required eg. lstmtraining -U ~/tesstutorial/bih/bih.unicharset
Does this
Meaning, if we just want to add a few characters to the unicharset, is it enough to have good sampling of those or do characters from the lstm unicharset (which are unknown at this point) need to be there too.
Traineddata files in tessdata for 4.0 were trained with --perfect_sample_delay 19
. The dafault value for the variable is 4.
The training command examples do not specify this. What are the recommended value to be used for finetuning and adding a layer?
@theraysmith
Please see https://groups.google.com/forum/#!topic/tesseract-ocr/-N5uPdSvJGA https://github.com/tesseract-ocr/tesseract/issues/642 https://github.com/tesseract-ocr/tesseract/issues/561
'core dumped' error in these cases seems to be related to using --eval_listfile as part of the lstmtraining command eg.
--eval_listfile ~/tesstutorial/saneval/san.training_files.txt
Please update the wiki, if you can confirm this, so that people are able to run the tutorial.
Thanks.
@amitdo Question to you, let me explain as briefly as I can:
tesseract in.ppm out -l deu --oem 2 txt
.I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected
Is there an easy way - I guess, it could be possible and would be very userfriendly -
Hi @Wikinaut!
Believe it or not, I haven't started yet playing with training the LSTM engine, so I don't know enough to answer your question. Hopefully, this serious 'bug' will be fixed sometime in the next month :-)
Some observations: Both 'für' and 'fiir' are in the wordlist. https://raw.githubusercontent.com/tesseract-ocr/langdata/master/deu/deu.wordlist
'ë' does not appear in the training text, 'é' appears 4 times. https://github.com/tesseract-ocr/langdata/blob/master/deu/deu.training_text
Café So für René für Cafés André
'für' appears 10 times in the training text.
OCR Engine modes: 0 Original Tesseract only. 1 Neural nets LSTM only. 2 Tesseract + LSTM. 3 Default, based on what is available.
Did you try --oem 1
?
@amitdo my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).
I also tried --oem 1
. but found, that --oem 2
gave the best results. However, I did not find an explanation, what this "mixed operation modes" are really doing, pls. can we add a short text to `"2 Tesseract + LSTM", I can supply a PR, but do not know what a correct and short description is.
@amitdo and regarding my question above, can I "quickly" retrain my "deu" training data (or a copy of it) with a corrected text, this would be really great?
Promise: some mBitcoins for this today!
Whoever coded the LSTM: Big APPLAUSE for him or her!
LSTM - New OCR engine based on neural networks. Tesseract - old OCR engine (started in the mid 80s) - does character segmentation and shape matching.
@amitdo yes, but what if one selects --oem 2
? Are then the results of both engines being compared or otherwise evaluated together ?
The two engines runs and the results are combined in some way.
:+1:
As said, I have zero experience training the LSTM engine.
What you want is described here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
@Wikinaut
my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).
Please provide a sample image for testing.
@Shreeshrii
case 1
case 2
case 3
Case 1
Case 2
ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.
für - is being recognized -see attached output files. e1ec5bba-dbc0-11e6-9e9a-8e65f50a9d60-oem1-png.txt beb10966-dbc0-11e6-8b86-f89117a7918c-oem1-png.txt 90011a70-dbc0-11e6-9889-d104dad6822a-oem1-png.txt
though ö was not recognized in one image.
ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.
https://en.wikipedia.org/wiki/German_language#Orthography It's not in the a German alphabet. it's from French. Still, maybe it should be included with the deu traineddata.
It does looks like 'ii' (two 'i's), doesn't it?
Maybe the training text needs some examples of 'ii' so it can learn to distinguish it from 'ü'.
@Shreeshrii in my conversion, these words "für" were recognized as "fiir". May be due to use of "unpaper" as preprocessor, and/or my use of "-l deu+eng --oem 2" for the conversion.
There were many more occurences of false-detecting "fiir" in my about 700 pages of text. This was the most frequent conversion error and triggered me to aksing you how I could retrain tessdata by using my corrected text file. A simple command line would be very helpful for such cases.
@amitdo regarding "ii": In my text, tesseract correctly ocr-ed "ii" in the words "Gummiisolation", and "Daiichi" (a name).
@theraysmith You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not:
(I described it already above:)
Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ?
re-running with re-trained tesseract' should in the best case result in
I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered.
This kind of retraining would be desirable, but is not available.
In your case, you don't need it though, as 4.00 works for all the examples of "für" that you provided. You just need to make sure you are using the latest code and data.
As Amit points out e diaresis is not in the German alphabet. I successfully correctly got "Citroën" by using fra+deu as the language. Unfortunately, it doesn't work with deu+fra, and neither works for the 2nd example. BTW this needed a bug fix for multi-language, which I will check in soon.
On Mon, Jan 16, 2017 at 9:26 AM, Wikinaut notifications@github.com wrote:
@theraysmith https://github.com/theraysmith You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not:
(I described it already above:)
Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ?
- in.pdf -> tesseract -> out.txt
- out.txt -> manually corrected -> corrected.txt
- retraining tesseract (to get tesseract' )with these inputs: in.pdf + corrected.txt
re-running re-trained tesseract should in the best case result in
- in-pdf -> tesseract' -> corrected.txt
I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/659#issuecomment-272920831, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056ZP1oiFo1bk3xNLf3lGGBiL7G-ODks5rS6hbgaJpZM4LigRu .
-- Ray.
@theraysmith Thank you for your swift answer.
In my case, many "für" were detected as "fiir", when or when not using unpaper
(I cannot remember, because I tried many different runs).
I will retry - and report here - with only -l deu
in order to present a correct case for reproduction.
@theraysmith to be more precise:
I tried tesseract with -oem 0, 1, 2
and found that "2"
gave the best results (for a 700 pages scan). I rerun with and without unpaper
, and found some differences. And I only used -l deu+eng
, because my German text used some English terms. Now, as I have a manually corrected reference output text I can present (later) a kind of matrix with the results.
New box renderer https://github.com/amitdo/tesseract/issues/3
Some of the problems with German texts were addressed in https://github.com/tesseract-ocr/langdata/pull/54, https://github.com/tesseract-ocr/langdata/pull/56 and https://github.com/tesseract-ocr/langdata/pull/57. I don't know whether those fixes are sufficient to improve future trainings.
@stweil @amitdo Stefan, please can you also make sure that common words with a https://en.wikipedia.org/wiki/Diaeresis_(diacritic) (Deutsch: Trema) like Citroën
are correctly trained ?
I addressed the more general question whether all European languages should support all typical diacritical characters in the tesseract-dev forum and need information from @theraysmith to proceed.
I successfully correctly got "Citroën" by using fra+deu as the language.
I expect that using additional languages has more side effects than recognizing additional characters, because they also add word list, unigram frequencies, word bigrams and so on for that languages which might have a negative effect on OCR results for texts which are mainly written in a single language but make sparely use of additional languages. Examples of such texts are German texts with foreign person or trade mark names, but also English scientific texts with additional Greek characters (a combination often used in mathematics and physics).
@stweil Thanks for your swift answers. Let me know, if I can help.
Wikinaut, you can try the new best/Latin.traineddata
@Wikinaut
I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected "Citroén" instead of the original word "Citroën" "fiir" instead of "für"
Does it work now with best traineddata?
Can I close this issue?
I have not tried the latest version. Pls. let this open - I will close it, if it's solved.
@Wikinaut,
The best/eng.traineddata doesn't have the marks you want.
Try the new best/Latin.traineddata.
The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata
includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.
@stweil I now use the new https://github.com/tesseract-ocr/tessdata_best data, and found that a problem with lowercase vs. uppercase "s" exists, in a 1000-page text,
typical incorrectly detected word patterns are:
The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.
Try to correct the mistakes in the wordlist and see if it helps to recognize these words.
... or run Tesseract without a wordlist. I recently removed the wordlists from the best traineddata to see and compare the real quality of the trained LSTM data. This is impossible when Tesseract uses a wordlist. With wordlists, Tesseract also invents words which don't occur in the original text ("computer" and "Google" in historical documents).
PS. Is there a parameter which disables the post OCR steps (like wordlist evaluation) in Tesseract without the need to remove the wordlists from the traineddata files?
Yes, there is a parameter which disables the wordlist evaluation.
I don't remember its name right now...
Please see https://github.com/tesseract-ocr/tesseract/issues/960
I guess, you can make the following two config variables as false to not load the wordlist dawg files.
load_system_dawg T load_freq_dawg T
The parameter is lstm_use_matrix.
I guess, you can make the following two config variables as false to not load the wordlist dawg files. load_system_dawg T load_freq_dawg T
load_system_dawg should work.
load_freq_dawg seems to have no impact on the lstm recognizer.
Those config variables related to the legacy engine. New traineddata files have a different lstm-word-dawg and have no freq-dawg files. So, I am not sure whether they will work. I haven't tried it yet.
I wonder why LSTM needs its own word list. I'd expect that a word list is different for different languages, and it is also reasonable to use different word lists for different kinds of text (topic, date) of the same language, but it should not depend on the OCR algorithm.
It is not that the wordlist is different, but the fact that the legacy engine and LSTM models might be using different unicharsets.
The creation and unpacking of dawgs requires unicharsets, that's why there are two sets of dawg files, even for numbers and punctuation, in addition to the wordlist.
@theraysmith
Ray, Thanks for updating the Wiki page for LSTM training. A few more changes in the following may be required in light of the updates:
Please also provide command for building traineddata with just the .lstm file or with just .lstm and lstm-dawgs (so as to minimize traineddata filesize, if only LSTM is going to be used).