tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

number of MAX_ITERATIONS #311

Closed whisere closed 1 year ago

whisere commented 1 year ago

Is this still the case: https://groups.google.com/g/tesseract-ocr/c/AnMYS98VwiE/m/1PN3mF6PAgAJ The MAX_ITERATIONS depends on the number lstmf files? If I have 1 millions pairs of images and text ground truth for training from scratch, if I want to cover all of them, should I set MAX_ITERATIONS to 1 millions? Thanks.

stweil commented 1 year ago

Typically you would set MAX_ITERATIONS to a multiple of the number of lines used for training.

whisere commented 1 year ago

Thanks! so is the multiple epoch: max_iterations = epoch * total number of text lines ? Are there some suggestions on the optimal multiple or epoch for training from scratch without overtraining? Thank you!

whisere commented 1 year ago

If the TARGET_ERROR_RATE can't be reached after training for a long time, is it right to kill the training process and run?: lstmtraining \ --stop_training \ --continue_from data/eeboecco/checkpoints/eeboecco_checkpoint \ --traineddata data/eeboecco/eeboecco.traineddata \ --model_output data/eeboecco.traineddata &

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.