Issues with UTF-8 -formatted markdown

mmarkdown / mmark

Mmark: a powerful markdown processor in Go geared towards the IETF

https://mmark.miek.nl

Other

480 stars 45 forks source link

Issues with UTF-8 -formatted markdown #163

Closed dwaite closed 2 years ago

dwaite commented 2 years ago

When encountering a file containing UTF-8 (e.g. non-US7ASCII bytes) it appears the characters are output to xml in a element, which is not part of the xml2rfc specification

When the file starts with a UTF-8 BOM (e.g. \xEF\xBB\xBF), the XML is generated without escaping the metadata section and with the following non-well-formed XML fragment toward the beginning:

<t><u format="char-num"><feff></u>

miekg commented 2 years ago

[ Quoting @.***> in "[mmarkdown/mmark] Issues with UTF-8..." ]

When encountering a file containing UTF-8 (e.g. non-US7ASCII bytes) it appears the characters are output to xml in a element, which is not part of the xml2rfc specification

sadly it is: https://xml2rfc.tools.ietf.org/xml2rfc-doc.html#name-u-new-2

I don't have a good answer, I've raised this on rfc-interest list and I think at some point utf-8 is just the new normal, for now we seem to need these crazy work around.

Note when detecting unicode xml2rfc generates a html entity, i.e. –.

/Miek

-- Miek Gieben

dwaite commented 2 years ago

Oh I thought RFC7991 was the definitive document.

miekg commented 2 years ago

[ Quoting @.***> in "Re: [mmarkdown/mmark] Issues with U..." ]

Oh I thought RFC7991 was the definitive document.

yeah... if only

7991 is a guide, the current spec isn't documented, it's what xml2rfc currently implements... I fully expect a new RFC when the dust settles, but for now this is the current state.

I would like <u> to not insert all that extra text and just render the unicode code point - but then again, why not go full UTF-8. Also see: https://github.com/rfc-format/draft-iab-xml2rfc-v3-bis/issues/205

/Miek

-- Miek Gieben

dwaite commented 2 years ago

The issue I encountered in this case was explanatory text that contains smart quotes.

The xml2rfc language seems to unfortunately require the use of "num" in 's format even when an ASCII alternative is given, so in this particular case the best option for me would be to add a post-commit to reject non-latin1 code points.

That said, I don't see any great options in terms of markdown tooling for unicode - the xml2rfc language seems to take all non-us7ascii text as normatively important, and automatically adding U+xxxx syntax to individual characters within a keyword or the like is sub-optimal. Seems like one needs a grouping construct for text data not already overridden for some specific formatting.

miekg commented 2 years ago

[ Quoting @.***> in "Re: [mmarkdown/mmark] Issues with U..." ]

The issue I encountered in this case was explanatory text that contains smart quotes.

I think everything is pointing into the direction of "just allow utf-8 everywhere".

We'll have to wait until we get there though.