jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.92k stars 2.18k forks source link

Invalid XML characters in output with Syntax.xml #1743

Open jorditpuig opened 2 years ago

jorditpuig commented 2 years ago

Background

I might be wrong but I believe there is a regression of issue #887. Most probably it was not really solved in the first place. Note that in #1556, which was solved with the same PR, the XML document is in version 1.1. XML 1.1 does permit escaped characters in the range [#x1-#x8] | [#xB-#xC] | [#xE-#x1F]. XML 1.0 on the other hand, does not permit ASCII control characters like . An XML parser should raise an error (and most of them actually do) if those characters appear in a document with an XML declaration of version 1.0. This holds both if the character is included in binary (e.g. "\u000c") or as an escaped sequence (e.g. ""). Furthermore, if I a am not mistaken, XHTML is a reformulation of XML version 1.0 [sic] and not 1.1. Finally, the w3c validator raises an error in all the variants of XHTML if the input includes the escape sequence .

If this is considered out of the scope of jsoup that's fine but developers should know that the output might not be a valid XML document even when OutputSettings.syntax is Syntax.xml. Another option would be to discard those characters when Syntax.xml is used. To be honest it's not clear if Syntax.xml refers to XML version 1.0 or 1.1. In the code of Entities.java though there is a reference to the spec of XML 1.0. Maybe the name of the constant should have been Syntax.xhtml and not Syntax.xml. Nevertheless, if it were to choose between the two versions of XML most probably it is more coherent to go for 1.0 since jsoup is an (X)HTML parser not an XML one.

I can help prepare a PR if someone can decide what the expected behaviour should be.

Desired behaviour


@Test
public void invalidCharactersDiscardedInXml() {
    String invalid = "AAABBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).toString();
    Assert.assertFalse(cleaned.contains("\f"));
    Assert.assertFalse(cleaned.contains(""));
    Assert.assertTrue(cleaned.matches("AAA\\ *BBB\\ *CCC"));
}

A subtle detail in the test above is that if prettyPrint is true  will be replaced with a space but other control characters forbidden in XML documents (e.g. ) will not. The underlying problem is not really solved if pretty printing is enabled.

Observed behaviour


@Test
public void invalidCharactersNotDiscardedInXml() {
    String invalid = "AAABBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).body().html();
    Assert.assertEquals(cleaned, "AAABBBCCC");
}
jhy commented 2 years ago

Thanks for the review. For max compatibility / least surprise within the existing solution, my inclination is that we should align this on XML 1.0 and the rules around it. The EscapeMode xhtml can be the owner of that rule. If we later want to add XML 1.1, we could perhaps make a new escape mode xml1.1 or similar.

The Syntax.xml is not supposed to be tied to HTML / XHTML but just be XML. And not include any of the e.g. HTML ellisions or specific serialization rules.

Potentially we could infer the escape mode to use by the setting of the syntax, or the meta-data in the document, with the ability to override.