w3c / DOM-Parsing

DOM Parsing and Serialization
https://w3c.github.io/DOM-Parsing/
Other
27 stars 14 forks source link

DocumentType XML serialization doesn't handle the presence of double quotes in system ID #71

Open cscott opened 3 years ago

cscott commented 3 years ago

In https://w3c.github.io/DOM-Parsing/#dfn-xml-serializing-a-documenttype-node we read:

  1. If the require well-formed flag is true and the node's systemId attribute contains characters that are not matched by the XML Char production or that contains both a """ (U+0022 QUOTATION MARK) and a "'" (U+0027 APOSTROPHE), then throw an exception; the serialization of this node would not be a well-formed document type declaration. ...
  2. If the node's systemId is not the empty string then append the following, in the order listed, to markup: 9.1 " " (U+0020 SPACE); 9.2 """ (U+0022 QUOTATION MARK); 9.3 The value of the node's systemId attribute; 9.4 """ (U+0022 QUOTATION MARK).

The intention here seems to be to use single-quotes to surround systemID if the systemID contains a double-quote, and double-quotes to surround systemID otherwise, only throwing an exception if both a single-quote and a double-quote are present in the systemId attribute. But that good idea got lost between step 2 and step 9, and we only/always use double-quotes to surround the systemId.

One of two fixes should be made: A. Tweak step 2 to remove mention to U+0027 APOSTROPHE and just throw the exception if the systemId contains U+0022 QUOTATION MARK; or B. change steps 9.2 and 9.4 to both say "U+0022 QUOTATION MARK if the node's systemID does not contain a U+0022 QUOTATION MARK, otherwise U+0027 APOSTROPHE".

Option B is what Firefox appears to do:

$doc = (new DOMParser()).parseFromString("<!DOCTYPE root SYSTEM 'foo\"bar'><root><child>text</child></root>", "text/xml");
(new XMLSerializer()).serializeToString($doc)

outputs

<!DOCTYPE root SYSTEM 'foo"bar'>
<root><child>text</child></root>