tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

Adding additional language Denjongke (sikkimese bhutia) to tesseract language dataset #52

Closed bloodgroup-cplusplus closed 8 months ago

bloodgroup-cplusplus commented 8 months ago

We are currently trying to add Sikkimese bhutia language for ocr language engine. The letters and words are similar to Dzongkha (dzo) language which is already present in the current dataset . However there are additional letters and words which are not included in Dzongkha dataset. How can we contribute ?

stweil commented 8 months ago

Please note that newer models and language support should no longer use langdata, but langdata_lstm. I therefore transferred this issue to that repository.

You can contribute a new language or script by sending a pull request for langdata_lstm. Use an existing language as a template for your contribution.

bloodgroup-cplusplus commented 8 months ago

Thanks a lot @stweil .. will do as suggested..

amitdo commented 8 months ago

If you want to have a new model, you have to train it yourself. Contributing the dataset does not mean that someone else will do the work for you.