ietf-tools / datatracker

The day-to-day front-end to the IETF database for people who work on IETF standards.
https://datatracker.ietf.org
BSD 3-Clause "New" or "Revised" License
607 stars 371 forks source link

pdf variant does not successfully include ⛄ (U+26C4 SNOWMAN WITHOUT SNOW) #6050

Open dkg opened 1 year ago

dkg commented 1 year ago

Describe the issue

draft-dkg-rfcediting-non-ascii-ietf-tooling is a test draft that contains multiple non-ascii characters. they all render just fine in the text and html variants, but the pdf variant fails to include ⛄ (U+26C4 SNOWMAN WITHOUT SNOW). it renders ☃ (U+2603 SNOWMAN) with no problem, though. Maybe this has something to do with codepoint coverage of the default fonts.

Code of Conduct

cabo commented 1 year ago

Font issue, I'd say

$ pbpaste | echars *** Miscellaneous Symbols (Common) ☃: U+2603 1 SNOWMAN ⛄: U+26C4 1 SNOWMAN WITHOUT SNOW

They are in the same group in Unicode, but of course fonts don't pick up whole groups. (And my browser is broken and shows both the same.)

rjsparks commented 1 year ago

Is this an issue with the pdf that comes out of xml2rfc, or the pdfized pdf-rendering-of-the-htmilzed-text that comes out of the datatracker? If the former, lets move this to the xml2rfc repo?

kesara commented 1 year ago

PDF generated by xml2rfc shows both snowmen. But from two different font groups. Probably need to include extra font on xml2rfc but I think we can tackle that if this gets to RFC-to-be stage.

draft-dkg-rfcediting-non-ascii-ietf-tooling-01.pdf

dkg commented 1 year ago

in case it wasn't clear, i don't intend draft-dkg-rfcediting-non-ascii-ietf-tooling to ever become an RFC! that's just a test harness so i can push back on some of the FUD i was hearing about how non-ASCII text might be broken.

I'm unaware of any RFC use case that would need either SNOWMAN character, but the demonstration is intended to highlight problems and identify structural issues in unicode coverage and transmission before some RFC really does try to use a symbol that isn't well-supported in one of the output formats.

The problem pdf i found was generated by the datatracker -- i don't know what toolchain was used. When generating the file locally with xml2rfc i do actually see both glyphs. It's possible that this is due to my having certain fonts available locally that are not available on the VM hosting the datatracker, but i don't know.

thanks for looking into it, i really appreciate all the work that has been done on making the RFC series capable of including robust, modern documents with a stable and expansive character set.

rjsparks commented 1 year ago

Thanks @dkg - I understand what you're doing - and what you provide above is enough for me to know which invocation of weasyprint to study. It's the one in the xml2fc environment used by the datatracker when it generates formats from xml submissions, which may well not have the right font set installed - we'll go look.

dkg commented 1 year ago

(for the record, this I-D has been much more useful than just identifying the SNOWMAN weirdness -- it demonstrated that use cases i heard active concerns about during IETF 117 (cyrillic text, mathematical symbols) do work fine. what you see in my reports are the corner cases where things remain broken -- but the real takeaway from this for me is that the use cases people actually care about are not broken. thanks for all the work that has gone into this!)

larseggert commented 1 year ago

In the web view, on my machine, the "snowman without snow" comes from the "Apple Color Emoji" font, and the "snowman with snow" comes from the "Menlo" font.

I guess that's because the CSS says font-family: "Noto Sans Mono", SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace. (But I don't understand where "Apple Color Emoji" comes from...)

That same CSS is passed into Weasyprint when making the PDF, and these are the fonts that end up in the PDF:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
IVBPZU+Noto-Sans-Mono                CID TrueType      Identity-H       yes yes yes     64  0
JUCHNC+Noto-Sans-Mono-Bold           CID TrueType      Identity-H       yes yes yes     68  0
TNCRMY+DejaVu-Sans-Mono              CID TrueType      Identity-H       yes yes yes     72  0
IZYCUH+DejaVu-Sans                   CID TrueType      Identity-H       yes yes yes     76  0

Not sure where/why "DejaVu" is picked up from, but I guess it doesn't have the character.

Since we want to use Noto, should we add https://fonts.google.com/noto/specimen/Noto+Emoji?