tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
637 stars 188 forks source link

Add support for different types of languages and finetune options #66

Closed Shreeshrii closed 5 years ago

Shreeshrii commented 5 years ago

Examples

For Tamil, Add a new font style (Impact)

 make clean MODEL_NAME=tam

 make training  MODEL_NAME=tam START_MODEL=tam LANG_TYPE=Indic FINETUNE_TYPE=Impact

For Arabic, Add new characters (Plus)

 make clean MODEL_NAME=ara

 make training  MODEL_NAME=ara START_MODEL=ara LANG_TYPE=RTL FINETUNE_TYPE=Plus

For English, From Scratch

 make clean MODEL_NAME=eng

 make training  MODEL_NAME=eng
Shreeshrii commented 5 years ago

The WordStr option creates the box files using tesseract and then replaces the OCRed text with the ground truth using sed and paste. There might be an alternate/better way to handle this.

Shreeshrii commented 5 years ago

@kba @wrznr Thank you both for your feedback. I will make the requested changes.

I have also been testing the makefile for use of 'script/xxx' traineddata as base model and have a few more changes. I will update once I make those changes too.

Shreeshrii commented 5 years ago

New PR posted as https://github.com/tesseract-ocr/tesstrain/pull/87