tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.09k stars 9.5k forks source link

user-words list seems only work when full-matched to the target word #2756

Open nai-kon opened 4 years ago

nai-kon commented 4 years ago

I'm using Tesseract with legacy dictionary (no lstm dictionary) for Japanese recognition. I added custom words on user-words dictionary, but it seems works when full-matched a word.

Japanese language has no spacing between words, like Germany. e.x. "トマトサラダ" (tomatosarada => tomato sarada)

When I put the image of "トマトサラダ" (tomatosarada) without dictionary, I got an incorrect result "トマトサうダ" (tomatosauda). So, I added "サラダ" (sarada) to user dictionary, but the result didn't changed. Next, I added "トマト" (tomato) and "サラダ" (sarada) individually to user dictionary, but the result didn't changed. Finally added the full word, "トマトサラダ" (tomatodarada), the result became correct.

I thought that dictionary correction was done by partial match, so I expected the words to add the dictionary is OK with each word of a compound noun. (e.x. add the "tomato" and "sarada", indivisually)

But with legacy dictionary, it seems done by full-match. What I afraid is should add the countless combinations of compound noun to dictionary.

Also, I've tried with lstm dictionary. "トマトサラダ" without dictionary became "ト マ ト サ ラ 人 ダ". So, I added "サラダ" (sarada) to user dictionary, the result became correct. With lstm dictionary, the dictionary correction seems done by partial match.


Now, I have two questions.

I searched around and couldn't find a solution. Please help me out. Thank you.

Environment

bertsky commented 4 years ago

@nai-kon Could you please specify in which PSM (page segmentation mode, see tesseract --help) you have been running? Also, a copy of the image you are describing would be helpful.

But with legacy dictionary, it seems done by full-match.

It shouldn't. So yes, what you are describing looks like a bug in the legacy engine. (And both engines should behave the same in that respect.)

What I afraid is should add the countless combinations of compound noun to dictionary.

You don't have to. Even for languages like German with whitespace-delimited words but undelimited compounds, you don't have to do that. (The dictionary will be applied at word boundaries with or without spaces.)

astrung commented 4 years ago

Hi. I have some problems when use user-words config. I posted a question in this link. So can anyone check and answer it for me? https://stackoverflow.com/questions/59307205/tesseract-5-0-bazaar-user-words-config-doesnt-work

astrung commented 4 years ago

does anyone have any idea? please help