metanorma / metanorma-ietf

Metanorma processor for IETF documents
BSD 2-Clause "Simplified" License
6 stars 5 forks source link

Constrain appearances of <u> #163

Closed opoudjis closed 3 years ago

opoudjis commented 3 years ago

From https://github.com/metanorma/metanorma-ietf/issues/161, we are injecting <u> around Unicode characters. That is triggering a syntax error in an instance of [[[SP800-131A,NIST SP 800-131A]]]. Need to constrain the occurrences of u more strictly, and to ensure that there are no smart quotes or dashes in IETF Metanorma XML, including in Relaton content.

opoudjis commented 3 years ago

The NIST Relaton entry contains a smart em-dash, &#8201;&#8212;&#8201;. We will need to unsmarten that... except, we wouldn't unsmart unicode content there, like diacritics...

opoudjis commented 3 years ago

We have constrained <u> explicitly to children of t blockquote li dd preamble td th annotation

We are currently unsmartening only single and double quote in IETF Metanorma XML. Need to extend to spaces and dashes.

IETF expects dashes to be dumb: Information Processing Systems - Local Area Networks - Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifications, 2nd edition, from https://xml2rfc.tools.ietf.org/public/rfc/bibxml-ieee/reference.IEEE.802-3.1990.xml

petithug commented 3 years ago

You may want to add .gsub(/\u2026/, "...") .gsub(/\u200b/, ""),

to the list.

opoudjis commented 3 years ago

Good call. Adding.

petithug commented 3 years ago

I also had some issues with "<" since 2.4.2. I fixed that with the following change:

-        n.replace(n.text.gsub(/[\u0080-\uffff]/, "<u>\\0</u>"))
+        n.text.gsub!(/[\u0080-\uffff]/, "<u>\\0</u>")