Open MaidaButtar opened 1 year ago
Examples taken from: https://opendata.uni-halle.de/handle/1981185920/88120
Unfortunately the problem originates from the data itself, which contains the letters already in reversed order. The workflow problem is related to https://github.com/ulb-sachsen-anhalt/ocrd-odem/issues/14, but hopefully this is gone by now.
Still, we must face the PDF text layer.
Regarding the word level representation in the online viewers, they are out of scope of this tool. Since the current OCR-run using ODEM is going to produce proper ordered characters, this will be fixed as soon as possible.
@MaidaButtar Can you please take a look into the PDF files again?
It seems to me that the rendered characters in the outline to navigate between sections / chapters (usually displayed at the left part of a PDF-viewer, like Firefox Browser) are properly ordered.
I checked the PDF files and now both known cases have occurred that not only is the order of the letters in the word inverted, but so is the order of the words. In other words, the first word is at the end of the line.
And it is correct, the subdivision of sections, chapters on the left is ordered properly on the PDF viewer.
@MaidaButtar Can you please try these cases and report their results:
And exactly which PDF-reader tool are you using?
@M3ssman I tested both Adobe Acrobat Reader, and also the PDF Viewer in Firefox browser.
-nothing is found as if there is no match
example: the word نام is searched for and displayed reversed. Thus, مان is displayed in the full text.
To give an update:
Therefore I'm afraid this issue is tied to the overall update of PDF generation (Next Version PDF Processing).
The following problems occur when recognizing and displaying the left-to-right and right-to-left orientation in the full-text display of the IIIF and DFG Viewer and in the PDF files in Persian:
IIIF Viewer: The order of the words is correct, but the letters in the words are reversed. The order of digits is correct (reason: numbers are read from left to right)
DFG Viewer: Line breaks are all gone, the order of words is halfway correct, but again, the letters in the word are reversed. The order of digits is correct.
PDF: Order of words reversed at line level, but letters in the word are not reversed. Order of digits correct..