user-words list seems only work when full-matched to the target word

nai-kon commented 4 years ago

I'm using Tesseract with legacy dictionary (no lstm dictionary) for Japanese recognition. I added custom words on user-words dictionary, but it seems works when full-matched a word.

Japanese language has no spacing between words, like Germany. e.x. "トマトサラダ" (tomatosarada => tomato sarada)

When I put the image of "トマトサラダ" (tomatosarada) without dictionary, I got an incorrect result "トマトサうダ" (tomatosauda). So, I added "サラダ" (sarada) to user dictionary, but the result didn't changed. Next, I added "トマト" (tomato) and "サラダ" (sarada) individually to user dictionary, but the result didn't changed. Finally added the full word, "トマトサラダ" (tomatodarada), the result became correct.

I thought that dictionary correction was done by partial match, so I expected the words to add the dictionary is OK with each word of a compound noun. (e.x. add the "tomato" and "sarada", indivisually)

But with legacy dictionary, it seems done by full-match. What I afraid is should add the countless combinations of compound noun to dictionary.

Also, I've tried with lstm dictionary. "トマトサラダ" without dictionary became "トマトサラ人ダ". So, I added "サラダ" (sarada) to user dictionary, the result became correct. With lstm dictionary, the dictionary correction seems done by partial match.

Now, I have two questions.

Q1. Does the Tesseract 3.X legacy engine correction done by full-match? If so, should I add the possible combination of compound noun to dictionary?
Q2. Does the Tesseract 4.X lstm engine correction done by partial match?

I searched around and couldn't find a solution. Please help me out. Thank you.

Environment

Tesseract Version: tesseract 4.1.0-rc1-95-g3baf leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Platform: Ubuntu 16.04.5 LTS

bertsky commented 4 years ago

@nai-kon Could you please specify in which PSM (page segmentation mode, see tesseract --help) you have been running? Also, a copy of the image you are describing would be helpful.

But with legacy dictionary, it seems done by full-match.

It shouldn't. So yes, what you are describing looks like a bug in the legacy engine. (And both engines should behave the same in that respect.)

What I afraid is should add the countless combinations of compound noun to dictionary.

You don't have to. Even for languages like German with whitespace-delimited words but undelimited compounds, you don't have to do that. (The dictionary will be applied at word boundaries with or without spaces.)

astrung commented 4 years ago

Hi. I have some problems when use user-words config. I posted a question in this link. So can anyone check and answer it for me? https://stackoverflow.com/questions/59307205/tesseract-5-0-bazaar-user-words-config-doesnt-work

astrung commented 4 years ago

does anyone have any idea? please help

tesseract-ocr / tesseract

user-words list seems only work when full-matched to the target word #2756

Environment