tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Difference between START_MODEL and PROTO_MODEL #194

Closed rambalachandran closed 3 years ago

rambalachandran commented 3 years ago

Can someone please tell the difference between START_MODEL and PROTO_MODEL. I'm not able to understand this from the README file which simply states the following

START_MODEL        Name of the model to continue from. Default: ''
PROTO_MODEL        Name of the proto model. Default: 'data/foo/foo.traineddata'

I need to fine tune English (eng) model with additional data. Which option should I be using and how should I be using?

Reading through the wiki I saw the following command

make -r training START_MODEL=Fraktur TESSDATA=/usr/local/share/tessdata/tessdata_best/script MAX_ITERATIONS=5000000 MODEL_NAME=Fraktur_5000000 RATIO_TRAIN=0.99

So to start from base english model can I create data/eng folder and place eng.traineddata and create a new folder eng_mod-ground-truth and place all the images and annotations there and then use the following command

make training START_MODEL=eng MODEL_NAME=eng_mod

Also I dont see a tessdata_best folder after installing tesseract and leptonica through make leptonica tesseract command

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wrznr commented 3 years ago

@rambalachandran Sorry for answering late! If you want to fine tune the English model eng you set it as start model you do not have to touch the proto model. Also, pls. do not use or create a folder data/eng. Only add data/eng-mod-ground-truth if you want to name your model eng-mod (and put your training data there). All other necessary directories are created automatically!

rambalachandran commented 3 years ago

@wrznr yes that worked. Thank you