tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

training finishes with low error but results on new images are not good #143

Closed Mohamed209 closed 4 years ago

Mohamed209 commented 4 years ago

I am trying to finetune with new fonts extracted from receipts , train finishes with error rate around 0.4 check samples of data attached , after training i want new weights to generalize so that new input receipt is parsed with lowest error rate , but when i run tesseract with new weights results are much worse i try to predict on mixed language receipts (english plus arabic) i finetune new arabic weights and compine it with original english weights (eng+new_arabic) but results are not good on pure english , pure arabic and mixed langs receipts , i am posting a question not issue , what my be potential reasons training_data.tar.gz check my make file

Makefile.txt

Mohamed209 commented 4 years ago

@Shreeshrii i know you are contributing in arabic fonts and languages

Shreeshrii commented 4 years ago

@Mohamed209 Yes, I have done a few runs of finetune training for Arabic. But I don't know the language and haven't used any of the traineddatas for actual recognition.

These are existing issues regarding use of multiple languages - see https://github.com/tesseract-ocr/tesseract/search?q=multi+language&type=Issues

Shreeshrii commented 4 years ago

@Mohamed209 Did you reverse the groundtruth to LTR in the box files?

Mohamed209 commented 4 years ago

@Shreeshrii So if you have finetuned Arabic weights before , how did you make sure that new weights provides best fit and generalize for new input images out of training and testing set

Mohamed209 commented 4 years ago

Also I have another question , could I fine tune for mixed languages images , to get a new weights file capable of performing good in mixed lang documents, and if this is ok , how annotations files (box files) will be , for example "ancsjd مبتته بيسسن hdkodg" The line itself is mixed between Arabic and English

Mohamed209 commented 4 years ago

@Mohamed209 Did you reverse the groundtruth to LTR in the box files?

yes , i handeled rtl conditions

Shreeshrii commented 4 years ago

script/Arabic was trained on English and all other languages using Arabic script. Ray Smith has not documented the process for multi language training.

You will have to experiment with different options and see what works.

You can see shreeshrii/tesstrain-arabic-gs and shreeshrii/tesstrain-ckb for my fine-tuning attempts. The lstmeval results as well as ocreval using the ISRI tools results are better than official traineddata. But I have not tested with any real life images.

On Mon, Feb 10, 2020, 16:11 mohamed mosad notifications@github.com wrote:

Also I have another question , could I fine tune for mixed languages images , to get a new weights file capable of performing good in mixed lang documents, and if this is ok , how annotations files (box files) will be , for example "ancsjd مبتته بيسسن hdkodg" The line itself is mixed between Arabic and English

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/143?email_source=notifications&email_token=ABG37I3TJIMPMHDHW2GX6YLRCEVMNA5CNFSM4KSBWQ4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIA4SQ#issuecomment-584060490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I77P44S3R6UD7MC3QTRCEVMNANCNFSM4KSBWQ4A .

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.