Open nai-kon opened 4 years ago
@nai-kon Could you please specify in which PSM (page segmentation mode, see tesseract --help
) you have been running? Also, a copy of the image you are describing would be helpful.
But with legacy dictionary, it seems done by full-match.
It shouldn't. So yes, what you are describing looks like a bug in the legacy engine. (And both engines should behave the same in that respect.)
What I afraid is should add the countless combinations of compound noun to dictionary.
You don't have to. Even for languages like German with whitespace-delimited words but undelimited compounds, you don't have to do that. (The dictionary will be applied at word boundaries with or without spaces.)
Hi. I have some problems when use user-words config. I posted a question in this link. So can anyone check and answer it for me? https://stackoverflow.com/questions/59307205/tesseract-5-0-bazaar-user-words-config-doesnt-work
does anyone have any idea? please help
I'm using Tesseract with legacy dictionary (no lstm dictionary) for Japanese recognition. I added custom words on user-words dictionary, but it seems works when full-matched a word.
Japanese language has no spacing between words, like Germany. e.x. "トマトサラダ" (tomatosarada => tomato sarada)
When I put the image of "トマトサラダ" (tomatosarada) without dictionary, I got an incorrect result "トマトサうダ" (tomatosauda). So, I added "サラダ" (sarada) to user dictionary, but the result didn't changed. Next, I added "トマト" (tomato) and "サラダ" (sarada) individually to user dictionary, but the result didn't changed. Finally added the full word, "トマトサラダ" (tomatodarada), the result became correct.
I thought that dictionary correction was done by partial match, so I expected the words to add the dictionary is OK with each word of a compound noun. (e.x. add the "tomato" and "sarada", indivisually)
But with legacy dictionary, it seems done by full-match. What I afraid is should add the countless combinations of compound noun to dictionary.
Also, I've tried with lstm dictionary. "トマトサラダ" without dictionary became "ト マ ト サ ラ 人 ダ". So, I added "サラダ" (sarada) to user dictionary, the result became correct. With lstm dictionary, the dictionary correction seems done by partial match.
Now, I have two questions.
Q1. Does the Tesseract 3.X legacy engine correction done by full-match? If so, should I add the possible combination of compound noun to dictionary?
Q2. Does the Tesseract 4.X lstm engine correction done by partial match?
I searched around and couldn't find a solution. Please help me out. Thank you.
Environment