tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

Use LSTM Engine for hin, nep, mar, san #76

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 7 years ago

Devanagari script languages in 4.00.00alpha have better accuracy using only LSTM engine rather than combined mode. Modify config file to use tessedit_ocr_engine_mode 1 as default instead of 2.

Shreeshrii commented 7 years ago

This applies only for 4.0 (not for 3.05).

aidinkrmz commented 6 years ago

Hello I'm a software engineering student and i use tesseract OCR engine in a university project. For persian language, traineddata which it's a file and it made by Training tesseract 4.00 and LSTM method, has a good result and output in Arial fonts but it doesn't have any good result in some specific fonts for persian. So the questions are : 1- did you use specific fonts like B Nazanin , B Roya or etc in Training Tesseract 4.00 with LSTM or not? 2- if they haven't used how can we use these fonts for getting better result? I prepared a text that all the cases of litrates have repeated for 10 or 15 or more than 15 times in this text. Also i used the process of training tesseract 3.05 for this text but i didn't get better and beneficial output. For achieving to a good result in persian in Tesseract OCR engine we need your experience and your help. Thanks for your attention Sincerely.

Shreeshrii commented 6 years ago

@aidinkrmz Please see https://github.com/tesseract-ocr/tessdata/issues/70 and post your reply there.

Did you test with the latest BEST farsi traineddata?

Shreeshrii commented 6 years ago

closing this, as ray will be updating langdata soon with new files.