Closed mshakirDr closed 5 years ago
No experiences with RTL languages yet. Sorry. Why not try reversing the strings and get back with a result here?
Thanks for reply. We assessed that Nastaleeq spacing (diagonal spacing) is something that is not solvable (at least for now) by the OCR. More details on this are in the following tesseract issue: https://github.com/tesseract-ocr/tesseract/issues/2407
Thanks for the information!
Hi it has been clarified else where that box files for RTL languages should be generated like LTR languages. The input data format for OCR-D is line images with corresponding text strings. The example data provided in the readme is straightforward for LTR script. However, is there a difference for RTL languages? Should the text string in .gt.txt be reversed? We are trying to train for Urdu but the final error rate is 90% or above for 444 line pairs (sample attached). We suspect that the direction is the cause. If that is indeed the case, should the text files by reversed at character level?
output.zip