Whitelist for Non-English Characters

YunTsen commented 4 years ago

Environment

Tesseract Version: tesseract v5.0.0-alpha.20200328
Platform: Windows, 64-bit

Current Behavior:

While using chi_tra to work on this image, the result was "載", which was great. However, after specifying the whielist using following commands: config='--oem 0 --psm 6 -c tessedit_char_whitelist=\u8f09' , (\u8509 is the unicode for "載")or config='--oem 0 --psm 6 -c tessedit_char_whitelist=載' the results turned out to be null.

It seems that whitelist could only accept English characters or digits(whitelist does work for numbers, I have tested that). How come?

p.s. I tried this because I wanted Tesseract to detect only the words on whitelist.

Expected Behavior:

Seems chi_tra could detect "載" accurately without the whitelist, it should also work if whitelist="載" is given.

Suggested Fix:

The variable tessedit_char_whitelist should accept non-English characters.

Moldoteck commented 4 years ago

Actually i have tried with russian characters and it worked pretty well. So, i am assuming that the problem is specific to some subset of the UTF-8

ChunkyZhang commented 3 years ago

我也一样遇到这个问题请问解决了吗

focusexplorer commented 3 years ago

把系统语言设置成utf8貌似是成功的。

amitdo commented 2 years ago

tessedit_char_whitelist=\u8f09

AFAIK. this usage is not supported.

Did you tried:

tessedit_char_whitelist=載

?

amitdo commented 2 years ago

Anyway, the aliowlist / denylist feature is known to not work well with the LSTM engine.

tesseract-ocr / tesseract