ietf-tools / xml2rfc

Generate RFCs and IETF drafts from document source in XML according to the IETF xml2rfc v2 and v3 vocabularies
https://ietf-tools.github.io/xml2rfc/
BSD 3-Clause "New" or "Revised" License
69 stars 39 forks source link

unexpected characters in RTL string in PDF output #903

Closed alicerusso closed 2 years ago

alicerusso commented 2 years ago

Describe the issue

This is regarding the 4-character string of Hebrew that appears 3 times in Appendix A.3 of the PDF below.

Seems there are some odd chars in the PDF output of xml2rfc of this document because a) copy & paste yields extraneous characters (for example: pasted into various applications as “שלוםԬԳשל“ showing U+052C and U+0533 in there) and b) it’s giving an error message when used as input to pdfaPilot ("Text cannot be mapped to Unicode").

output of xml2rfc: https://www.rfc-editor.org/authors/rfc9290before.pdf https://www.rfc-editor.org/authors/rfc9290.txt https://www.rfc-editor.org/authors/rfc9290.html

source: https://www.rfc-editor.org/authors/rfc9290.xml

Seemingly narrowed the error to the Hebrew strings (i.e., If I remove the 3 lines from the XML file that contain the Hebrew string, then run xml2rfc to make the PDF, then run pdfaPilot on it, the pdfaPilot "Text cannot be mapped to Unicode" error goes away.)

xml2rfc 3.15.0 WeasyPrint version 56.1

Code of Conduct

cabo commented 2 years ago

Is it intentional that there is a mixture of Roboto-Mono and DejaVu-Sans-Mono in the PDF, with a fallback to Dejavu-Sans (no -Mono) for the relevant שלום strings?

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
CPQNPJ+Noto-Serif-Bold               CID TrueType      Identity-H       yes yes yes    366  0
NHZIEB+Noto-Serif                    CID TrueType      Identity-H       yes yes yes    370  0
KAVNOB+Noto-Serif-Italic             CID TrueType      Identity-H       yes yes yes    374  0
JMUSCG+DejaVu-Sans-Mono              CID TrueType      Identity-H       yes yes yes    378  0
QYUIBE+Roboto-Mono                   CID TrueType      Identity-H       yes yes yes    382  0
KOBGSK+Noto-Serif-Hebrew             CID TrueType      Identity-H       yes yes yes    386  0
FKVZRG+DejaVu-Sans                   CID TrueType      Identity-H       yes yes yes    390  0

In the clipboard, I see this:

d7 a9
d7 9c
d7 95
d7 9d
f4 8f b0 82
f4 8f b0 81
d7 a9
d7 9c

is U+10FC02 is U+10FC01

Both are private-use characters that I don't immediately find in the PDF.

pdftotext turns the PDF into

22 (quote)
e2 80 ab (RIGHT-TO-LEFT EMBEDDING)
22 (quote)
d7 9d 
d7 95 
d7 9c 
d7 a9 
20 
d7 9c 
d7 a9 
e2 80 ac (POP DIRECTIONAL FORMATTING)

Note the 20 (space) where the two private-use characters were, and also the occurrence of the repeated d7a9 (U+05E9) and d79c (U+05DC). (Note that pdftotext has some reordering logic that obscures this result somewhat.)

I'm not a PDF expert, so this is about as far as I will get.

cabo commented 2 years ago

(The private-use characters I see with pdfkit are different from the U+052C CYRILLIC CAPITAL LETTER DCHE and U+0533 ARMENIAN CAPITAL LETTER GIM that you see. This might be another indication of malformed PDF.)

jrlevine commented 2 years ago

I believe this is a chronic problem with our PDF renderings. Everything is supposed to be in Noto or Roboto, but other fonts that happen to be installed on the machine where they're rendered leak in. For that reason many of the early PDFs have SourceCodePro and a little TimesNewRoman. I presume there is some way to tell Weasyprint which fonts to use and to make it stick but it's hard and is the sort of thing we need to test before using in production. PS: I told you so.

alicerusso commented 2 years ago

@jrlevine FYI, in May 2020 the RPC installed a missing font package (Roboto Mono) so that PDFs use the intended fonts for monospace characters. (The xml2rfc release notes for v2.45.0 include that a warning was added about missing Roboto Mono fonts as well as installation instructions.)

As far as odd fonts "leaking in" after that point, running pdffonts on published PDFs would show how often this happens. I'm not sure why DejaVu-Sans-Mono and DejaVu-Sans are in rfc9290.pdf.

kesara commented 2 years ago

DejaVu-Sans-Mono is leaking from the svg element which has the following attribute: font-family="monospace".

kesara commented 2 years ago

I think I nailed down the issues with various font leakages, see #905 & #906.

kesara commented 2 years ago

These unexpected characters in RTL string in PDF output are actually caused by the font leakage. Closing this issue since the issue will be fixed in #905 & #906. PDF without font leakages: rfc9290.pdf