tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

Add support for Shan language (shn) #33

Closed ronaldaug closed 2 years ago

ronaldaug commented 4 years ago

Could someone help me to add the Shan language in tesseract?

Shan language = https://en.wikipedia.org/wiki/Shan_language Language code = shn Shan Wiki = https://shn.wikipedia.org All Shan words (including IPA) = jsonfile Websites that are using Shan scripts = https://shannews.org/ , http://shanunicode.com/ Font = https://saosu-mp.github.io/font/PangLong/PangLong.ttf Shan syllable break = https://github.com/kwarm/syllable-break

Some Shan characters such as င သ တ ထ ပ မ ယ ရ လ ဝ ႉ း ွ ု ူ ိ ီ ် ၊ ။ are similar to Myanmar (Burmese).

Thanks in advance

ronaldaug commented 4 years ago

It seems this repo isn't active or maintained.

stweil commented 4 years ago

That's correct, this repo is for the old Tesseract 3.05 and the legacy OCR recognizer. The more recent repository is https://github.com/tesseract-ocr/langdata_lstm. Should I move this issue to that repo?

ronaldaug commented 4 years ago

Yes please, thanks @stweil .

stweil commented 4 years ago

@ronaldaug, do you want to prepare a pull request which adds shn, maybe based on https://github.com/tesseract-ocr/langdata_lstm/tree/master/mya?

ronaldaug commented 4 years ago

Ok, I'll prepare and send a pull request to "/tesseract-orc/langdata_Istm" based on mya and other languages.

ronaldaug commented 2 years ago

@stweil Sorry for bothering you. Is this repo still active? I've created PR for Shan language. Do I have to train it by myself?

stweil commented 2 years ago

Yes, the repo is active. I also noticed your pull request, but had no time to review it up to now. Ideally Shan support and training should be done by someone who knows that language (so not by me).

ronaldaug commented 2 years ago

Thanks for your quick response. Though I'm not very familiar to tesseract-ocr training process, I'll try it.