tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

finetuning questions #190

Closed agagsgroove closed 3 years ago

agagsgroove commented 4 years ago

Dear all,

we are interested in using Tesseract for recognizing document IDs e.g. passport ids, serial numbers or credit card numbers.

We installed Tesseract 4.1.0 and we tried to use userPatterns to constrain the Tesseract predictions, for example our file eng.userPatternA contains one line \A\A\d\d\d and we set parameter "user_patterns_suffix" to "userPatternA"

However, we are not able to see a difference between the Tesseract preditions with userPatterns switched on or off. Is the userPatterns parameter supported in Tesseract 4.1.0 or are we missing something here?

As a next step, we plan to finetune Bengali and Arabic from float models as a base, e.g. from tessdata_best repository. Do you have any suggestions how to choose the best parameters for finetuning? We found parameters such as segment_non_alphabetic_script but don't know whether we should use these parameters.

In general, we provide binearised cropped images (image heigth 36 pixels) of the IDs or serial numbers as input for finetuning as well as reading. Are there further image processing techinques which improves the finetuning/predictions? Moreover, are there additional steps such as deactivating dictionaires to improve training/accuracy?

Cheers,

Jong

wrznr commented 4 years ago

:wave:

Your first question is related to Tesseract itself and should be placed in the Tesseract user forum. Concerning your second question, I think that @Shreeshrii has already made quite a number of trainings on Arabic. Maybe he can share some of his insights in the Wiki which would be the best way to answer such a general question. In general, we would be happy if you simply start and get back to us with your experiences and/or problems. Last but not least, dictionaries are always deactivated unless you provide them (even if the model you fine tune on has dictionaries).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.