tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.94k stars 9.48k forks source link

Whitelist for Non-English Characters #3070

Open YunTsen opened 4 years ago

YunTsen commented 4 years ago

Environment

Current Behavior:

1 While using chi_tra to work on this image, the result was "載", which was great. However, after specifying the whielist using following commands: config='--oem 0 --psm 6 -c tessedit_char_whitelist=\u8f09' , (\u8509 is the unicode for "載")or config='--oem 0 --psm 6 -c tessedit_char_whitelist=載' the results turned out to be null.

It seems that whitelist could only accept English characters or digits(whitelist does work for numbers, I have tested that). How come?

p.s. I tried this because I wanted Tesseract to detect only the words on whitelist.

Expected Behavior:

Seems chi_tra could detect "載" accurately without the whitelist, it should also work if whitelist="載" is given.

Suggested Fix:

The variable tessedit_char_whitelist should accept non-English characters.

Moldoteck commented 4 years ago

Actually i have tried with russian characters and it worked pretty well. So, i am assuming that the problem is specific to some subset of the UTF-8

ChunkyZhang commented 3 years ago

我也一样遇到这个问题 请问解决了吗

focusexplorer commented 3 years ago

把系统语言设置成utf8貌似是成功的。

amitdo commented 2 years ago

tessedit_char_whitelist=\u8f09

AFAIK. this usage is not supported.

Did you tried:

tessedit_char_whitelist=載

?

amitdo commented 2 years ago

Anyway, the aliowlist / denylist feature is known to not work well with the LSTM engine.