tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Maldivian - Dhivehi - Thaana #52

Closed Shreeshrii closed 7 years ago

Shreeshrii commented 7 years ago

https://en.wikipedia.org/wiki/Maldivian_writing_systems

https://www.ethnologue.com/language/div

http://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Thaa

http://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Qa61

Ref: https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L31

Shreeshrii commented 7 years ago

Fonts

https://dhivehi.mv/fonts/

http://www.wazu.jp/gallery/Fonts_Thaana.html

http://www.fontspace.com/category/thaana

Webtext

Maldivian - Thaana script - http://crubadan.org/languages/dv

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290235084

@theraysmith commented 2 days ago After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...

Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with data, with better estimates of what is in the character set and better wordlists.

nashrafeeg commented 2 years ago

#59 (comment)

@theraysmith commented 2 days ago After a lot of work, and a very long delay, the new training is almost ready to go. Just waiting for rendering to finish...

Fixes in this round: Utilizes a new crawl of the www for ~60 languages that had the least training data, and ~15 new languages that we didn't have before. This provides much more training data, with data, with better estimates of what is in the character set and better wordlists.

hi @Shreeshrii is dhivehi now fully supported ?