tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.39k stars 9.52k forks source link

The "@" and "â" characters and .traineddata for Tesseract 4 for some languages which should recognize those... but there seem to be recognition issues #1544

Open MaxPower85 opened 6 years ago

MaxPower85 commented 6 years ago

I've tried .traineddata from here https://github.com/tesseract-ocr/tessdata and from here https://github.com/tesseract-ocr/tessdata_fast for some languages like Serbian Latin, Croatian and Turkish and they seemed to have major issues recognizing the "@" character and "â".

The .traineddata for Turkish for Tesseract 4 seems to always confuse "@" with the "©" character (copyright sign), while .traineddata for Serbian Latin or Croatian seem to confuse it with various characters at random (like "G" and "O"... etc).

Serbian, Croatian and Turkish also use the "â" character in various words (although that character isn't included as some separate letter in the alphabet, but they still use it)... You can see that it is mentioned here: https://en.wikipedia.org/wiki/%C3%82#Serbo-Croatian

So "â" and "Â" (in case that the whole word is spelled in caps-lock) should be included for Serbian, Croatian and Turkish... but "â" just seems to be recognized as an ordinary "a".

In Serbian and Croatian, without the letter "â" it may not be clear at all what someone meant if the context wasn't explained because some words with very different meanings would otherwise look the same if someone just used an ordinary "a"... also, the "â" character can be the only way to differentiate between a singular and a plural form of various words in some cases if the context wasn't explained (often, words that end with an "a" would have a plural form that replaces "a" with an "â").

amitdo commented 6 years ago

Uncommon letters/symbols in the training samples will lead to sub-optimal recognition of these symbols.

Shreeshrii commented 6 years ago
  1. Also try recognition with 'tessdata_fast'

  2. Run 'combine-tessdata -u' on the traineddata files and review the lstm-unicharset to check whether needed letters are included. I find it easier to sort the file to check coverage.

  3. Try fine-tune/plus-minus training using commonly used fonts for the language. Add 15-20 samples of the letters not being recognised well currently to a representative training text. Use the wiki page on training 4.0.0 for details.

If no new letters are added ie. if unicharset is the same, you can do fine-tune training, max 300-400 iterations with good result.

Shreeshrii commented 6 years ago

Typo above, I meant tessdata_best