tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
826 stars 886 forks source link

Some characters missing in spa.training_text makes Tesseract fail recognizing them #137

Open diegodlh opened 5 years ago

diegodlh commented 5 years ago

When running unicharset_extractor on the Spanish langdata, it warns that capital "Ñ", capital "É" and "«" are absent from the training text (while their counterparts, "ñ", "é" and "»", are present). This makes Tesseract then fail to recognize this characters with --oem 0 (for example, it recognizes "Ñ" as "NN", and "É" as "EI"). I'm a beginner in the subject of Tesseract training and I'm not sure how these training_text files are generated. It seems to me they are more or less a random set of words and short phrases. It occurred to me I could simply make some replacements to cover these missing characters: España -> ESPAÑA, años -> AÑOS, también -> TAMBIÉN, México -> MÉXICO, and also replaced half occurrences of "»" with "«". If my assumption that this file is mostly random, please consider pulling this commit into master. Thank you

Shreeshrii commented 5 years ago

Thank you. This training text file is suitable for tesseract 3.0x (base tesseract). For 4.0 and lstm training please see the langdata_lstm repo.

diegodlh commented 5 years ago

Effectively, I retried tesstrain.sh with langdata_lstm and the training_text file is so long that this time unicharset_extractor did not complain about missing characters. Still, as users may still be using langdata to train their tesseract 3.0x engine (or tesseract 4.0 with --oem 0, as I understand it), I deem it useful to merge my commit into plain langdata's master branch. Thanks!