kcobra / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Vietnamese dictionaries #1392

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Unpack vie.traineddata downloaded from Tesseract repository
2. Run dawg2wordlist on vie.freq-dawg & vie.word-dawg to recover original lists
3. Examine the content

What is the expected output? What do you see instead?

The recovered word lists are found to be incomplete and contain many erroneous 
entries.

Please use the included dictionaries for training data for Vietnamese language.

Original issue reported on code.google.com by nguyen...@gmail.com on 12 Dec 2014 at 3:04

Attachments:

GoogleCodeExporter commented 9 years ago
can you have a look and review Vietnamese dictionaries in langdata repository?

https://code.google.com/p/tesseract-ocr/source/browse/vie/?repo=langdata&name=ma
ster

Original comment by zde...@gmail.com on 16 Apr 2015 at 1:52

GoogleCodeExporter commented 9 years ago
vie.wordlist.clean would need to be scrapped totally as it contains so many 
misspelled Vietnamese and English words, words missing diacritical marks or 
running on together (Vietnamese words are mostly monosyllables).

The provided vie.words_list is composed of several lists commonly used among 
Vietnamese-language application developers, including those from 
http://www.informatik.uni-leipzig.de/~duc/software/misc/wordlist.html.

The fourth column in vie.unicharambigs contains many characters that are not 
Vietnamese, e.g., üûñËÄ. Those characters should not be used for match 
target.

http://vietunicode.sourceforge.net/charset/vietalphabet.html

Original comment by nguyen...@gmail.com on 19 Apr 2015 at 8:05

GoogleCodeExporter commented 9 years ago
Moved to github: https://github.com/tesseract-ocr/langdata/pull/9

Original comment by joregan on 13 May 2015 at 10:06