Closed Mohamed209 closed 4 years ago
@Shreeshrii i know you are contributing in arabic fonts and languages
@Mohamed209 Yes, I have done a few runs of finetune training for Arabic. But I don't know the language and haven't used any of the traineddatas for actual recognition.
These are existing issues regarding use of multiple languages - see https://github.com/tesseract-ocr/tesseract/search?q=multi+language&type=Issues
@Mohamed209 Did you reverse the groundtruth to LTR in the box files?
@Shreeshrii So if you have finetuned Arabic weights before , how did you make sure that new weights provides best fit and generalize for new input images out of training and testing set
Also I have another question , could I fine tune for mixed languages images , to get a new weights file capable of performing good in mixed lang documents, and if this is ok , how annotations files (box files) will be , for example "ancsjd مبتته بيسسن hdkodg" The line itself is mixed between Arabic and English
@Mohamed209 Did you reverse the groundtruth to LTR in the box files?
yes , i handeled rtl conditions
script/Arabic was trained on English and all other languages using Arabic script. Ray Smith has not documented the process for multi language training.
You will have to experiment with different options and see what works.
You can see shreeshrii/tesstrain-arabic-gs and shreeshrii/tesstrain-ckb for my fine-tuning attempts. The lstmeval results as well as ocreval using the ISRI tools results are better than official traineddata. But I have not tested with any real life images.
On Mon, Feb 10, 2020, 16:11 mohamed mosad notifications@github.com wrote:
Also I have another question , could I fine tune for mixed languages images , to get a new weights file capable of performing good in mixed lang documents, and if this is ok , how annotations files (box files) will be , for example "ancsjd مبتته بيسسن hdkodg" The line itself is mixed between Arabic and English
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/143?email_source=notifications&email_token=ABG37I3TJIMPMHDHW2GX6YLRCEVMNA5CNFSM4KSBWQ4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIA4SQ#issuecomment-584060490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I77P44S3R6UD7MC3QTRCEVMNANCNFSM4KSBWQ4A .
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am trying to finetune with new fonts extracted from receipts , train finishes with error rate around 0.4 check samples of data attached , after training i want new weights to generalize so that new input receipt is parsed with lowest error rate , but when i run tesseract with new weights results are much worse i try to predict on mixed language receipts (english plus arabic) i finetune new arabic weights and compine it with original english weights (eng+new_arabic) but results are not good on pure english , pure arabic and mixed langs receipts , i am posting a question not issue , what my be potential reasons training_data.tar.gz check my make file
Makefile.txt