Closed calixteman closed 3 years ago
Looking at the PDF file contents using https://brendandahl.github.io/pdf.js.utils/browser/, we should most likely use the RC
field of the annotation (see https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=400&zoom=auto,-215,558) since it contains rtl
spans, which would make this a duplicate of #2966 I think given that it involves rich text content?
I checked the contents of the RC
field and they aren't exactly what we have in the rendered annotation.
In acrobat, they're used for the rendering in the text editor for the annotation (I changed some colors in the xml and there were no impact in the rendered annotation but just in the editor).
Ah, okay. I wonder what Acrobat uses then; the only related property I could find otherwise is the language property in one of the root nodes. The annotation itself doesn't seem to contain anything other than the RC
property that indicates that it should be RTL, unless I missed something of course.
Just for clarifying the issue what I have understand Basically Hebrew languages start from RIGHT TO LEFT (RTL) but in pdf.js is to start from LEFT TO RIGHT (LTR) Here there are some screenshot
I'd say that they're using classical algorithms to guess what are the ltr/rtl parts in the text. So we could try to see if we can use https://github.com/mozilla/pdf.js/blob/master/src/core/bidi.js which could help to fix this issue since almost all text is LTR. For a text 50% in english and 50% in hebrew we would have to detect the different parts and either insert some LTR/RTL markers or generate some spans with a dir attribute.
For a text 50% in english and 50% in hebrew we would have to detect the different parts and either insert some LTR/RTL markers or generate some spans with a dir attribute.
How common would such a case actually be in practice though? My guess is that it'd probably be quite rare. Also, how does e.g. Adobe Reader handle the /Contents being an equal mix of LTR and RTL locales?
Basically I cannot help wonder if we need to "complicate" the (initial) implementation all that much, or if we could simply use the dir
-property that the bidi
-function returns as-is and not worry about any edge-cases?
I fully agree with you: there are no need to over-complicate things because my guesses are the same as yours.
So, the bidi
approach seems to work nicely for this case :-)
However, I've not yet run tests so I suppose that there may be some fallout that needs to be fixed; I'll assign this to myself and continue working on it during the weekend.
Attach (recommended) or Link to PDF file here: mytestfile.pdf
Configuration:
Steps to reproduce the problem:
What is the expected behavior? (add screenshot) The text layout should be RTL.
What went wrong? (add screenshot) The text layout is LTR.
It works correctly in evince and in acrobat. I added a
dir="rtl"
on thep
element in using devtools and the rendering is the same as in acrobat.