metanorma / metanorma-ietf

Metanorma processor for IETF documents
BSD 2-Clause "Simplified" License
6 stars 5 forks source link

Malformed Output XML for rfc caused by presence of `&` entity #211

Closed manuelfuenmayor closed 3 months ago

manuelfuenmayor commented 3 months ago

In relation to https://github.com/metanorma/rfc-asciidoc-rfc/issues/26

Description

Presence of & entity in XML reference causes compilation failure.

XML reference sample:

<reference anchor='Asciidoctor-Manual' 
  target='http://asciidoctor.org/docs/user-manual/'>
  <front>
    <title>Asciidoctor: A fast text processor &amp; publishing 
      toolchain for converting AsciiDoc to HTML5, 
      DocBook &amp; more.</title>
    <author initials='D.' surname='Allen' fullname='Dan Allen'>
      <organization />
    </author>
    <author initials='R.' surname='Waldron' fullname='Ryan Waldron'>
      <organization />
    </author>
    <author initials='S.' surname='White' fullname='Sarah White'>
      <organization />
    </author>

(See <title> tag)

To replicate

Use this branch from rfc-asciidoc-rfc repository: https://github.com/metanorma/rfc-asciidoc-rfc/tree/test-amp-bug, and try to compile the draft-ribose-asciirfc.adoc file. To succeed in compilation, remove the &amp; entities from both: ./references/informative/asciidoctor-manual.xml and ./references/informative/asciidoctor.xml.

opoudjis commented 3 months ago

Yeah, this is going to need double escaping.

Passthrough, which is being invoked here, converts < to &lt; and & to &amp;, as it should:

<title>Asciidoctor: A fast text processor &amp; publishing
      toolchain for converting AsciiDoc to HTML5,
      DocBook &amp; more.</title>

ends up as

 &lt;title&gt;Asciidoctor: A fast text processor &amp; publishing
      toolchain for converting AsciiDoc to HTML5,
      DocBook &amp; more.&lt;/title&gt;

Which will convert back to:

<title>Asciidoctor: A fast text processor & publishing
      toolchain for converting AsciiDoc to HTML5,
      DocBook &amp; more.</title>

And that's the problem: the passthrough is assuming &amp; should be left as &amp;, and it shouldn't, it should end up as &amp;amp;.

opoudjis commented 3 months ago

Pass blocks are being processed by decoding and then re-encoding XML escapes, to ensure that any escapes in passthroughs are resolved. That is resulting in & &amp; => & & => &amp; &amp;

The only way we are going to get a coherent outcome here is if we do not decode the XML: if we find &amp; in a pass through, we good and encode it as &amp;mp;, and if we see &#x200c;, we do not try and resolve it to . That was a naive decision, and Asciidoctor doesn't try that kind of thing; it passes through & &amp; to HTML as & &amp;. Neither should Metanorma.