tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
837 stars 888 forks source link

Vietnamese #66

Open Shreeshrii opened 7 years ago

Shreeshrii commented 7 years ago

Forwarding below some feedback re Vietnamese traineddata for 4.00.00

Vietnamese lang data for tess 4.00 seems to have better accuracy, but still sometimes mixes up between acute and hook above marks when they appear on top of circumflex mark (stack diacritics).

Shreeshrii commented 7 years ago

While testing some Seven Segment Display images, I noted that vie gives better result compared to eng.

nguyenq commented 7 years ago

With 4.00alpha vie language pack, many non-Viet alphabets appear in the output text, such as: öïäåů€†čµñÎīšçðßęě

theraysmith commented 7 years ago

Thanks! I will put them in forbidden_characters.