jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Jsoup produces invalid XML #1937

Closed mvera closed 1 year ago

mvera commented 1 year ago

Jsoup doesn't handle correctly ampersand in scripts when converting to XML. Jsoup doesn't handle correctly   entity when converting to XML. In these cases XML generated is invalid. Find below a junit test which reproduces the bugs. Test is standalone and only requires java. Java 17. Jsoup 1.15.4 Test 1 and 3 fail. The test 2 is OK.

package jsoup.test;

import java.io.ByteArrayInputStream;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.jsoup.Jsoup;
import org.junit.Test;
import org.xml.sax.SAXException;

public class JsoupTest {
    @Test
    public void testHtml2XmlWithJsoupWithAmpersand() {
        String htmlDoc = "<!DOCTYPE html><html><head><script>&</script></head><body></body></html>";
        html2xml(htmlDoc);
    }

    @Test
    public void testHtml2XmlWithJsoupWithAmpersandEntity() {
        String htmlDoc = "<!DOCTYPE html><html><head><script>anything &amp;</script></head><body></body></html>";
        html2xml(htmlDoc);
    }

    @Test
    public void testHtml2XmlWithJsoupWithNbspEntity() {
        String htmlDoc = "<!DOCTYPE html><html><head></head><body>&nbsp;</body></html>";
        html2xml(htmlDoc);
    }

    private void html2xml(String htmlDoc) {
        try {
            // generate xml with jsoup
            org.jsoup.nodes.Document document = Jsoup.parse(htmlDoc);
            document.outputSettings().syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
            document.outputSettings().charset("UTF-8");
            String xml = document.html();

            // parse xml with standard Java libraries
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = factory.newDocumentBuilder();
            ByteArrayInputStream input = new ByteArrayInputStream(xml.getBytes("UTF-8"));
            builder.parse(input);
        } catch (IOException | ParserConfigurationException | SAXException e) {
            throw new RuntimeException(e);
        }
    }

}
jhy commented 1 year ago

The problem you are facing is that you are still parsing the input as HTML, which means jsoup will treat tags like script as HTML specific. Particularly, in script, entities are not escaped. Changing the output syntax to xml but leaving the input as html is intended to create XHTML output.

If you specify to use the jsoup XML parser, your tests will pass.

org.jsoup.nodes.Document document = Jsoup.parse(htmlDoc, Parser.xmlParser());

As in #1942, it may be simpler to use the W3CDom here.