Closed GoogleCodeExporter closed 9 years ago
can you have a look and review Vietnamese dictionaries in langdata repository?
https://code.google.com/p/tesseract-ocr/source/browse/vie/?repo=langdata&name=ma
ster
Original comment by zde...@gmail.com
on 16 Apr 2015 at 1:52
vie.wordlist.clean would need to be scrapped totally as it contains so many
misspelled Vietnamese and English words, words missing diacritical marks or
running on together (Vietnamese words are mostly monosyllables).
The provided vie.words_list is composed of several lists commonly used among
Vietnamese-language application developers, including those from
http://www.informatik.uni-leipzig.de/~duc/software/misc/wordlist.html.
The fourth column in vie.unicharambigs contains many characters that are not
Vietnamese, e.g., üûñËÄ. Those characters should not be used for match
target.
http://vietunicode.sourceforge.net/charset/vietalphabet.html
Original comment by nguyen...@gmail.com
on 19 Apr 2015 at 8:05
Moved to github: https://github.com/tesseract-ocr/langdata/pull/9
Original comment by joregan
on 13 May 2015 at 10:06
Original issue reported on code.google.com by
nguyen...@gmail.com
on 12 Dec 2014 at 3:04Attachments: