XML for W3C specs is broken (anchor is incorrect)

reschke commented 2 years ago

Describe the issue

curl https://bib.ietf.org/public/rfc/bibxml4/reference.W3C.REC-xml-stylesheet-20101028.xml
<reference anchor="W3C_REC_xml_stylesheet_20101028" target="https://www.w3.org/TR/2010/REC-xml-stylesheet-20101028/">
  <front>
    <title>Associating Style Sheets with XML documents 1.0 (Second Edition)</title>
    <author fullname="Henry Thompson" role="editor"/>
    <author fullname="James Clark" role="editor"/>
    <author fullname="Simon Pieters" role="editor"/>
    <date day="28" month="October" year="2010"/>
  </front>
  <seriesInfo name="W3C REC" value="REC-xml-stylesheet-20101028"/>
  <seriesInfo name="W3C" value="REC-xml-stylesheet-20101028"/>
</reference>

Anchor needs to be:

W3C.REC-xml-stylesheet-20101028

(You really can't change the anchors; they are part of the contract for including the references)

Code of Conduct

[X] I agree to follow the IETF's Code of Conduct

reschke commented 2 years ago

Note that this is a serious regression which breaks existing documents...

ronaldtse commented 2 years ago

@reschke thanks for raising this issue. This is indeed a bug, we will deal with this.

reschke commented 2 years ago

As this is a breaking change - would it be possible to roll back to a version without that bug until it gets fixed? (there's a real risk that people will start "adjusting" their documents, and then we'll be in a very undesirable state).

strogonoff commented 2 years ago

There’s a PR pending for this now.

kesara commented 2 years ago

Fix deployed to https://bib.ietf.org

curl https://bib.ietf.org/public/rfc/bibxml4/reference.W3C.REC-xml-stylesheet-20101028.xml
<reference anchor="W3C.REC-xml-stylesheet-20101028" target="https://www.w3.org/TR/2010/REC-xml-stylesheet-20101028/">
  <front>
    <title>Associating Style Sheets with XML documents 1.0 (Second Edition)</title>
    <author fullname="Henry Thompson" role="editor"/>
    <author fullname="James Clark" role="editor"/>
    <author fullname="Simon Pieters" role="editor"/>
    <date day="28" month="October" year="2010"/>
  </front>
  <seriesInfo name="W3C REC" value="REC-xml-stylesheet-20101028"/>
  <seriesInfo name="W3C" value="REC-xml-stylesheet-20101028"/>
</reference>

reschke commented 2 years ago

But then:

  <seriesInfo name="W3C REC" value="REC-xml-stylesheet-20101028"/>
  <seriesInfo name="W3C" value="REC-xml-stylesheet-20101028"/>

That doesn't look right.

strogonoff commented 2 years ago

Hi @reschke,

I see this in the new RFCXML spec about seriesInfo:

Specifies the document series in which this document appears, and also specifies an identifier within that series.

A processing tool determines whether it is working on an RFC or an Internet-Draft by inspecting the name attribute of a element inside the element inside the element, looking for "RFC" or "Internet-Draft". (Specifying neither value in any of the elements can be useful for producing other types of documents but is out of scope for this specification.)

The elements as output currently don’t seem to violate the spec, but based on your comment I assume they may violate the expectations of some consumers. Do you mean that consumers will break if they encounter anything other than this?

As an aside, a little more detail about xml2rfc tools path API consumers would be appreciated. From the original requirements it’s not fully clear how exactly they use xml2rfc path output and what are their relevant failure modes.

(I blame myself for not clarifying these uses in detail ahead of implementation, but even now having more details could save some back and forth. Based on prior discussions I was under the impression that XML output is not required to be exactly the same as in the previous implementation of xml2rfc tools, but it seems that aspects in which it is expected to be exactly the same are numerous.)

ronaldtse commented 2 years ago

W3C does not seem to employ a documented document identifier scheme that is intended for seriesInfo. The question I think @reschke is raising is that it “looks different” from the bibxml4 W3C entries, but what it should be is undefined and the implications unknown.

reschke commented 2 years ago

The issue here is that you have two entries where one is sufficient. Any renderer will use both, which will create very confusing output.

In doubt, leave things they were before.

I also note that the authors do not appear in the order on the document, which might be a problem with the W3C database.

Ages ago I wrote tools to generate references for W3C specs, see output over here: https://www.greenbytes.de/tech/webdav/w3c-references.html#ref-REC-xml-stylesheet-20101028 - note that the annotation contains what back then the W3C told me was their preference for citing their specs in the IETF.

strogonoff commented 2 years ago

Thanks. I’ll direct my request for information elsewhere.

In doubt, leave things they were before.

I’m not in doubt as long as the spec is satisfied, but in this case my lack of doubt apparently broke the consumers regardless.

W3C told me was their preference for citing their specs in the IETF.

@ronaldtse We definitely don’t output <annotation>, unlike the XML sample at above link. Should we check with W3C?

This seems like another doctype-specific XML rule, so the universal XML serializer in relaton-py/bibxml-service seems like a bad idea. It’d be a mess of conditionals, we should probably implement a small pluggable system and allow extensions alter the final markup (e.g., depending on what SDO appears to have published the doc, etc.).

reschke commented 2 years ago

Ok - risking to state the obvious...

I would recommend running regression tests with a RFCXML document that includes references of all types (DOI, W3C, STD, BCP etc), and not to update the system until all changes in rendered plain text or HTML (your choice, doesn't matter much) are understood and found to be intentional.

strogonoff commented 2 years ago

@reschke This was not originally implemented because returned XML was expected to differ (to be specific, the service was supposed to return more correct, complete and up-to-date data compared to xml2rfc tools). In some cases entire xml2rfc directories were required to differ, e.g. because they used to start XML with a preamble which is not expected anymore. There are tests that are supposed to fail if the spec (XML schema) is violated, but no tests that fail when returned data is different (although there is a script that diffs returned XML with previous version for human inspection).

However, given a flood of issues where differing data breaks consumers—including breakages caused by strictly speaking more correct, normalized and complete data (such as <organization abbr="IANA">Internet Assigned Numbers Authority</organization> instead of <organization>IANA</organization>)—it appears that the original approach was in fact not the right one. Immediate and long-term solutions are being considered.

ietf-tools / bibxml-service

XML for W3C specs is broken (anchor is incorrect) #286

Describe the issue

Code of Conduct