Closed alicerusso closed 2 years ago
Is it intentional that there is a mixture of Roboto-Mono and DejaVu-Sans-Mono in the PDF, with a fallback to Dejavu-Sans (no -Mono) for the relevant שלום strings?
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
CPQNPJ+Noto-Serif-Bold CID TrueType Identity-H yes yes yes 366 0
NHZIEB+Noto-Serif CID TrueType Identity-H yes yes yes 370 0
KAVNOB+Noto-Serif-Italic CID TrueType Identity-H yes yes yes 374 0
JMUSCG+DejaVu-Sans-Mono CID TrueType Identity-H yes yes yes 378 0
QYUIBE+Roboto-Mono CID TrueType Identity-H yes yes yes 382 0
KOBGSK+Noto-Serif-Hebrew CID TrueType Identity-H yes yes yes 386 0
FKVZRG+DejaVu-Sans CID TrueType Identity-H yes yes yes 390 0
In the clipboard, I see this:
d7 a9
d7 9c
d7 95
d7 9d
f4 8f b0 82
f4 8f b0 81
d7 a9
d7 9c
is U+10FC02 is U+10FC01
Both are private-use characters that I don't immediately find in the PDF.
pdftotext turns the PDF into
22 (quote)
e2 80 ab (RIGHT-TO-LEFT EMBEDDING)
22 (quote)
d7 9d
d7 95
d7 9c
d7 a9
20
d7 9c
d7 a9
e2 80 ac (POP DIRECTIONAL FORMATTING)
Note the 20 (space) where the two private-use characters were, and also the occurrence of the repeated d7a9 (U+05E9) and d79c (U+05DC). (Note that pdftotext has some reordering logic that obscures this result somewhat.)
I'm not a PDF expert, so this is about as far as I will get.
(The private-use characters I see with pdfkit are different from the U+052C CYRILLIC CAPITAL LETTER DCHE and U+0533 ARMENIAN CAPITAL LETTER GIM that you see. This might be another indication of malformed PDF.)
I believe this is a chronic problem with our PDF renderings. Everything is supposed to be in Noto or Roboto, but other fonts that happen to be installed on the machine where they're rendered leak in. For that reason many of the early PDFs have SourceCodePro and a little TimesNewRoman. I presume there is some way to tell Weasyprint which fonts to use and to make it stick but it's hard and is the sort of thing we need to test before using in production. PS: I told you so.
@jrlevine FYI, in May 2020 the RPC installed a missing font package (Roboto Mono) so that PDFs use the intended fonts for monospace characters. (The xml2rfc release notes for v2.45.0 include that a warning was added about missing Roboto Mono fonts as well as installation instructions.)
As far as odd fonts "leaking in" after that point, running pdffonts on published PDFs would show how often this happens. I'm not sure why DejaVu-Sans-Mono and DejaVu-Sans are in rfc9290.pdf.
DejaVu-Sans-Mono
is leaking from the svg
element which has the following attribute: font-family="monospace"
.
I think I nailed down the issues with various font leakages, see #905 & #906.
These unexpected characters in RTL string in PDF output are actually caused by the font leakage. Closing this issue since the issue will be fixed in #905 & #906. PDF without font leakages: rfc9290.pdf
Describe the issue
This is regarding the 4-character string of Hebrew that appears 3 times in Appendix A.3 of the PDF below.
Seems there are some odd chars in the PDF output of xml2rfc of this document because a) copy & paste yields extraneous characters (for example: pasted into various applications as “שלוםԬԳשל“ showing U+052C and U+0533 in there) and b) it’s giving an error message when used as input to pdfaPilot ("Text cannot be mapped to Unicode").
output of xml2rfc: https://www.rfc-editor.org/authors/rfc9290before.pdf https://www.rfc-editor.org/authors/rfc9290.txt https://www.rfc-editor.org/authors/rfc9290.html
source: https://www.rfc-editor.org/authors/rfc9290.xml
Seemingly narrowed the error to the Hebrew strings (i.e., If I remove the 3 lines from the XML file that contain the Hebrew string, then run xml2rfc to make the PDF, then run pdfaPilot on it, the pdfaPilot "Text cannot be mapped to Unicode" error goes away.)
xml2rfc 3.15.0 WeasyPrint version 56.1
Code of Conduct