Wordlist cleaning (lots of incomplete words found in Thai wordlist)

bact commented 6 years ago

I spotted a bunch of instances in langdata/tha/tha.wordlist that for sure they are invalid Thai words, since they go against word formation rules (like having a vowel that require an immediate consonant, but that consonant is missing).

For example [line number][instance]: 165 ส์ 207 ต์ 335 ย์ 404 ท์ 428 ด์ 527 น์ 580 ห์ 629 อั 658 ล์ 774 ค์ 787 นั 798 ชั่ 863 เอ็ 886 สั 986 มั 1114 ว์ 1187 ฮั 1244 ชั 1305 กั 1310 ษ์ 1380 ลั 1487 บั 1554 ดั ... 7649 ลั่ 7656 ยั 7666 ฉั 7733 เกี๋ 7914 ล่ 7931 น๊ 8008 ส่ 8045 ญั 8100 ข์ 8148 ด่ ...

Should we remove all these instances? They seems to have some patterns as well, like:

char + u0e31 : c ั
char + u0e31 + tonemarks : c ั่
char + u0e4c : c ์
u0e40 + char + u0e47 : เc็

Do instance in this wordlist meant to be a word in itself, or it suppose to be a component of a larger word? If it's the latter case, it's totally ok to leave them as they are. But if it's the first case, we should remove them, as they are not words.

I'm not entirely sure how tesseract utilizes XXX.wordlist in langdata, so please correct me if this is irrelevant. Thank you.

Shreeshrii commented 6 years ago

This is a file from 3.04. I would suggest that you unpack the current traineddata, extract the wordlist from it and see if the list is the same.

If it is, try removing the error words from that list, combine the traineddata again and test for accuracy.

The commands should be similar to the following, please change as per the paths in your setup.

combine_tessdata -u ./tessdata_best/tha.traineddata ./tessdata_TEST/tha.
dawg2wordlist ./tessdata_TEST/tha.lstm-unicharset ./tessdata_TEST/tha.lstm-word-dawg ./tessdata_TEST/tha.lstm-word-list

REVIEW & EDIT wordlist

wordlist2dawg ./tessdata_TEST/tha.lstm-word-list ./tessdata_TEST/tha.lstm-word-dawg ./tessdata_TEST/tha.lstm-unicharset 
combine_tessdata ./tessdata_TEST/tha.

COMPARE accuracy of ./tessdata_best/tha.traineddata and ./tessdata_TEST/tha.traineddata

bact commented 6 years ago

Thank you for detailed instructions. I will try that accordingly.

bact commented 6 years ago

Saw those error words in current tha.traineddata (from https://github.com/tesseract-ocr/tessdata_best) as well.

Current ./tessdata_best/tha.lstm-word-list : 9083 lines Modified ./tessdata_TEST/tha.lstm-word-list : 8811 lines (272 error words removed)

Compare the two tessdata with a screenshot of short text from https://prachatai.com/journal/2018/02/75448 (chose two paragraphs with Thai text only), with options "--oem 1 -l tha" (LSTM, Thai).

No much difference in accuracy, as both went as bad :( Although the modified tessdata is slightly (very slightly) better.

Example original text: กิติภูมิ กล่าวว่า มาร์กบอกว่าการต่อสู้ทางชนชั้น

Output text from current tessdata: ก ิ ต ิ ภู ม ิ ก ล ่ า ว ว ่ า ม า ร ์ ก บ อ ก ว ่ า ก า ร ต ่ อ ส ู ้ ท า ง ชน ชั ้ น

Output text from modified tessdata: ก ิ ต ิ ภู ม ิ ก ล ่ า ว ว ่ า ม า ร ์ ก บ อ ก ว ่ า ก า ร ต ่ อ ส ู ้ ท า ง ชน ชั้น

Characters got recognized perfectly in both tessdata. But as you can see, most of the time characters are separated by space. It shouldn't.

The only difference between outputs from current tessdata and modified tessdata here is that the last word "ชั้น" from modified tessdata is actually comes combined as a proper word, no spaces in between.

In general, by removing impossible combination of characters in Thai language from the word list, the output is a little more accurate. But maybe I need to adjust some config.

Current tha.config:

segsearch_max_futile_classifications 10 language_model_ngram_on 1 language_model_ngram_space_delimited_language F chop_enable 0

These are patterns of words that got removed: ^.[่้๊๋็ํั์]$ ^.[ัื][่้๊๋]$ ^เ.[็ิีื][่้๊๋]?$

Shreeshrii commented 6 years ago

Extra spaces could be related to issue reported earlier (for a different language) - see https://github.com/tesseract-ocr/tesseract/issues/1009

You may want to try ocr with

-c preserve_interword_spaces=1

to remove extra spaces

bact commented 6 years ago

Thank you! Extra spaces solved with -c preserve_interword_spaces=1

From the same web page, tested with several different parts of text, current tessdata and modified tessdata produced exactly the same output.

No improvement in terms of accuracy can be measured from the test.

Shreeshrii commented 6 years ago

so looks like that wordlist is not used much in recognition.

Shreeshrii commented 6 years ago

@jbreiden

preserve_interword_spaces=1 should be added to the config files in tessdata_fast for CJK languages and Thai.

Shreeshrii commented 6 years ago

Extra space problem identified in the comment above - https://github.com/tesseract-ocr/langdata/issues/106#issuecomment-365960730

Characters got recognized perfectly in both tessdata. But as you can see, most of the time characters are separated by space. It shouldn't.

Fixed via tesseract-ocr/tessdata_fast#7

@zdenop Please close this issue, after PR is merged in tessdata_fast.

tesseract-ocr / langdata

Wordlist cleaning (lots of incomplete words found in Thai wordlist) #106