ietf-tools / xml2rfc

Generate RFCs and IETF drafts from document source in XML according to the IETF xml2rfc v2 and v3 vocabularies
https://ietf-tools.github.io/xml2rfc/
BSD 3-Clause "New" or "Revised" License
65 stars 38 forks source link

Zero-width spaces in updates/obsoletes #876

Open martinthomson opened 2 years ago

martinthomson commented 2 years ago

Describe the issue

RFC 8996 contains a LOT of RFC numbers in its updates attribute. Along with some of those, it includes a Unicode zero-width space character (U+200b).

While it is not clear that whitespace is allowed in this attribute, xml2rfc has been tolerant of whitespace thus far. The wrinkle here is that python's strip() does not, by default, recognize "\u200b" as whitespace. So what has happened is that the character has made its way into the links in HTML that xml2rfc generates. The resulting links are bad. (The HTML is also bad, but that is of less immediate consequence.)

Options here appear to be:

  1. Be more tolerant of garbage. While there is no reason to use U+200b in this attribute, that happened. xml2rfc can strip it out. The problem with this is that a point fix here only creates inconsistency in the interface. Why is U+200b allowed here but not elsewhere?
  2. Don't change xml2rfc, but instead fix the problem at the source. This is probably a one-off. Though it messes with the RFC immutability thing, that's not sacrosanct.

My opinion is that the second course is better, but that requires broader discussion, probably in RSWG. Either way, I wanted to open this to track the issue.

Code of Conduct

JayDaley commented 1 year ago

Personal opinion:

I think it is clear enough in RFC 7991 that the updates attribute can only contain number, commas and draft names, not whitespace or anything else.

My preferred way forward would be to have ave xml2rfc strip everything that is not an RFC number, comma or I-D name. This would leave the formatting of that in the rendered RFCs to xml2rfc, which is more appropriate given that a) this is header/metadata not content and so the formatting is for the series editors to manager not the authors; and b) the tool can do a better job of formatting across all the different renderings.

It is unfortunate that we now have an RFC with "\u200b" in that attribute as it is likely to break or at least confuse any code that parses it. If we are going to republish the XML then "unexpected stuff in the XML that is contrary to our documentation and likely to break parsers" is probably high up the list of reasons to do so.