tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Language Request: Kurdish Sorani (Central Kurdish) #296

Open makwanbarzan opened 2 years ago

makwanbarzan commented 2 years ago

There's already a trained data file for the Latin dialect of the Kurdish language. Sorani dialect is the second most used dialect of the language and it'd be amazing to have a trained data file in Tesseract.

The script is Persian-like, except having a few different letters like ژ، گ، ڤ، چ، ۆ. So it shouldn't take so much effort to develop.

Thank you and I'm looking forward to getting a response.

stweil commented 2 years ago

All those characters are included in the script/Arabic model. Maybe that already works for Sorani text?