Invalid XML characters in output with Syntax.xml

Background

I might be wrong but I believe there is a regression of issue #887. Most probably it was not really solved in the first place. Note that in #1556, which was solved with the same PR, the XML document is in version 1.1. XML 1.1 does permit escaped characters in the range [#x1-#x8] | [#xB-#xC] | [#xE-#x1F]. XML 1.0 on the other hand, does not permit ASCII control characters like . An XML parser should raise an error (and most of them actually do) if those characters appear in a document with an XML declaration of version 1.0. This holds both if the character is included in binary (e.g. "\u000c") or as an escaped sequence (e.g. ""). Furthermore, if I a am not mistaken, XHTML is a reformulation of XML version 1.0 [sic] and not 1.1. Finally, the w3c validator raises an error in all the variants of XHTML if the input includes the escape sequence .

If this is considered out of the scope of jsoup that's fine but developers should know that the output might not be a valid XML document even when OutputSettings.syntax is Syntax.xml. Another option would be to discard those characters when Syntax.xml is used. To be honest it's not clear if Syntax.xml refers to XML version 1.0 or 1.1. In the code of Entities.java though there is a reference to the spec of XML 1.0. Maybe the name of the constant should have been Syntax.xhtml and not Syntax.xml. Nevertheless, if it were to choose between the two versions of XML most probably it is more coherent to go for 1.0 since jsoup is an (X)HTML parser not an XML one.

I can help prepare a PR if someone can decide what the expected behaviour should be.

Desired behaviour


@Test
public void invalidCharactersDiscardedInXml() {
    String invalid = "AAA&#xc;BBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).toString();
    Assert.assertFalse(cleaned.contains("\f"));
    Assert.assertFalse(cleaned.contains("&#xc;"));
    Assert.assertTrue(cleaned.matches("AAA\\ *BBB\\ *CCC"));
}

A subtle detail in the test above is that if prettyPrint is true  will be replaced with a space but other control characters forbidden in XML documents (e.g. ) will not. The underlying problem is not really solved if pretty printing is enabled.

Observed behaviour


@Test
public void invalidCharactersNotDiscardedInXml() {
    String invalid = "AAA&#xc;BBB\fCCC";
    Document doc = Jsoup.parseBodyFragment(invalid);
    OutputSettings settings = new OutputSettings().escapeMode(EscapeMode.xhtml).syntax(Syntax.xml).prettyPrint(false);
    String cleaned = doc.outputSettings(settings).body().html();
    Assert.assertEquals(cleaned, "AAA&#xc;BBB&#xc;CCC");
}

jhy / jsoup