ulb-sachsen-anhalt / digital-derivans

Derive new digitals from existing ones
MIT License
6 stars 2 forks source link

right to left and left to right Orientation of the Persian digital copies in the full-text display on IIIF, DFG Viewer and PDF #54

Open MaidaButtar opened 1 year ago

MaidaButtar commented 1 year ago

The following problems occur when recognizing and displaying the left-to-right and right-to-left orientation in the full-text display of the IIIF and DFG Viewer and in the PDF files in Persian:

IIIF Viewer: The order of the words is correct, but the letters in the words are reversed. The order of digits is correct (reason: numbers are read from left to right)

DFG Viewer: Line breaks are all gone, the order of words is halfway correct, but again, the letters in the word are reversed. The order of digits is correct.

PDF: Order of words reversed at line level, but letters in the word are not reversed. Order of digits correct..

DFG VIEWER IIIF VIEWER PDF

MaidaButtar commented 1 year ago

Examples taken from: https://opendata.uni-halle.de/handle/1981185920/88120

M3ssman commented 1 year ago

Unfortunately the problem originates from the data itself, which contains the letters already in reversed order. The workflow problem is related to https://github.com/ulb-sachsen-anhalt/ocrd-odem/issues/14, but hopefully this is gone by now.

Still, we must face the PDF text layer.

M3ssman commented 1 year ago

Regarding the word level representation in the online viewers, they are out of scope of this tool. Since the current OCR-run using ODEM is going to produce proper ordered characters, this will be fixed as soon as possible.

@MaidaButtar Can you please take a look into the PDF files again?

It seems to me that the rendered characters in the outline to navigate between sections / chapters (usually displayed at the left part of a PDF-viewer, like Firefox Browser) are properly ordered.

MaidaButtar commented 1 year ago

I checked the PDF files and now both known cases have occurred that not only is the order of the letters in the word inverted, but so is the order of the words. In other words, the first word is at the end of the line.

And it is correct, the subdivision of sections, chapters on the left is ordered properly on the PDF viewer.

M3ssman commented 1 year ago

@MaidaButtar Can you please try these cases and report their results:

And exactly which PDF-reader tool are you using?

MaidaButtar commented 1 year ago

@M3ssman I tested both Adobe Acrobat Reader, and also the PDF Viewer in Firefox browser.

  1. when you search for a word which is displayed correctly in the navigation there are two cases.
    • searching for a word will result in the word being displayed in the wrong order. The searched word is not found and displayed in the headings, but the reverse variant is found and marked in the text. But it is not the reversed word which is marked, but some other. But if you look closely, you can see that the word is on the same line.

-nothing is found as if there is no match

  1. In this case, the words are displayed reversed. Again, it is not the exact word that is marked, but some word, but if you look closely, you can see that the word is on the same line, only reversed,
MaidaButtar commented 1 year ago

example: the word نام is searched for and displayed reversed. Thus, مان is displayed in the full text.

M3ssman commented 11 months ago

To give an update:

Therefore I'm afraid this issue is tied to the overall update of PDF generation (Next Version PDF Processing).