tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

ground-truth for RTL (Urdu) #64

Closed mshakirDr closed 5 years ago

mshakirDr commented 5 years ago

Hi it has been clarified else where that box files for RTL languages should be generated like LTR languages. The input data format for OCR-D is line images with corresponding text strings. The example data provided in the readme is straightforward for LTR script. However, is there a difference for RTL languages? Should the text string in .gt.txt be reversed? We are trying to train for Urdu but the final error rate is 90% or above for 444 line pairs (sample attached). We suspect that the direction is the cause. If that is indeed the case, should the text files by reversed at character level?

output.zip

wrznr commented 5 years ago

No experiences with RTL languages yet. Sorry. Why not try reversing the strings and get back with a result here?

mshakirDr commented 5 years ago

Thanks for reply. We assessed that Nastaleeq spacing (diagonal spacing) is something that is not solvable (at least for now) by the OCR. More details on this are in the following tesseract issue: https://github.com/tesseract-ocr/tesseract/issues/2407

wrznr commented 5 years ago

Thanks for the information!