mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.35k stars 9.97k forks source link

Text annotation in hebrew is rendered in LTR mode #14046

Closed calixteman closed 3 years ago

calixteman commented 3 years ago

Attach (recommended) or Link to PDF file here: mytestfile.pdf

Configuration:

Steps to reproduce the problem:

  1. Open the pdf
  2. Click on the yellow area

What is the expected behavior? (add screenshot) The text layout should be RTL.

What went wrong? (add screenshot) The text layout is LTR.

It works correctly in evince and in acrobat. I added a dir="rtl" on the p element in using devtools and the rendering is the same as in acrobat.

timvandermeij commented 3 years ago

Looking at the PDF file contents using https://brendandahl.github.io/pdf.js.utils/browser/, we should most likely use the RC field of the annotation (see https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=400&zoom=auto,-215,558) since it contains rtl spans, which would make this a duplicate of #2966 I think given that it involves rich text content?

calixteman commented 3 years ago

I checked the contents of the RC field and they aren't exactly what we have in the rendered annotation. In acrobat, they're used for the rendering in the text editor for the annotation (I changed some colors in the xml and there were no impact in the rendered annotation but just in the editor).

timvandermeij commented 3 years ago

Ah, okay. I wonder what Acrobat uses then; the only related property I could find otherwise is the language property in one of the root nodes. The annotation itself doesn't seem to contain anything other than the RC property that indicates that it should be RTL, unless I missed something of course.

raheemalzeeshan commented 3 years ago

Just for clarifying the issue what I have understand Basically Hebrew languages start from RIGHT TO LEFT (RTL) but in pdf.js is to start from LEFT TO RIGHT (LTR) Here there are some screenshot

hebrew ltr hebrew RTL

calixteman commented 3 years ago

I'd say that they're using classical algorithms to guess what are the ltr/rtl parts in the text. So we could try to see if we can use https://github.com/mozilla/pdf.js/blob/master/src/core/bidi.js which could help to fix this issue since almost all text is LTR. For a text 50% in english and 50% in hebrew we would have to detect the different parts and either insert some LTR/RTL markers or generate some spans with a dir attribute.

Snuffleupagus commented 3 years ago

For a text 50% in english and 50% in hebrew we would have to detect the different parts and either insert some LTR/RTL markers or generate some spans with a dir attribute.

How common would such a case actually be in practice though? My guess is that it'd probably be quite rare. Also, how does e.g. Adobe Reader handle the /Contents being an equal mix of LTR and RTL locales?

Basically I cannot help wonder if we need to "complicate" the (initial) implementation all that much, or if we could simply use the dir-property that the bidi-function returns as-is and not worry about any edge-cases?

calixteman commented 3 years ago

I fully agree with you: there are no need to over-complicate things because my guesses are the same as yours.

Snuffleupagus commented 3 years ago

So, the bidi approach seems to work nicely for this case :-) However, I've not yet run tests so I suppose that there may be some fallout that needs to be fixed; I'll assign this to myself and continue working on it during the weekend.