tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

RTL language training issue #301

Closed sameearif88 closed 2 years ago

sameearif88 commented 2 years ago

Issue: Characters change on reversing the text Groundtruth: عَلیٰ عَبْدِہٖ الْکِتَاب Reversed groundtruth: ۔ باَتِکْلا ٖہِدْبَع ٰیلَع

For RTL language training we have to reverse the groundtruth but characters change because of this. عَ becomes ع after reversing the string so if we make unicharset file from .box file it won't include عَ which is the original character. Won't this affect the accuracy of the OCR? Is there a solution to fix this issue? Can we flip the image instead of reversing the groundtruth?