htmlparser's HtmlDocumentBuilder should support the XPath API - Githubissues

validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/

Other

56 stars 26 forks source link

htmlparser's HtmlDocumentBuilder should support the XPath API #11

Closed anthonyvdotbe closed 4 years ago

anthonyvdotbe commented 4 years ago

To reproduce, create an App.java file with the contents as below, and run it with htmlparser's jar available on the classpath. Note that the XPath expression only works when parsing the source with the JDK's DocumentBuilder. When using htmlparser's HtmlDocumentBuilder, it always returns an empty result.

import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;

import nu.validator.htmlparser.common.XmlViolationPolicy;
import nu.validator.htmlparser.dom.HtmlDocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class App {

    private static final String SOURCE = 
            "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\" ?>" +
            "<html>" +
                "<body>" +
                    "<h3>foo</h3>" +
                    "<h3>bar</h3>" +
                    "<h3>baz</h3>" +
                "</body>" +
            "</html>";

    private static final String QUERY = "//h3";

    public static void main(String... args) throws Exception {
        query(DocumentBuilderFactory.newInstance().newDocumentBuilder());
        query(new HtmlDocumentBuilder(XmlViolationPolicy.FATAL));
        query(new HtmlDocumentBuilder(XmlViolationPolicy.ALLOW));
        query(new HtmlDocumentBuilder(XmlViolationPolicy.ALTER_INFOSET));
    }

    private static void query(DocumentBuilder builder) throws Exception {
        Document document = builder.parse(new InputSource(new StringReader(SOURCE)));
        XPathExpression query = XPathFactory.newInstance().newXPath().compile(QUERY);
        var numResults = ((NodeList) query.evaluate(document, XPathConstants.NODESET)).getLength();
        System.out.println(numResults);
    }

}

sideshowbarker commented 4 years ago

Which exact jar are you using? Where did the jar come from?

anthonyvdotbe commented 4 years ago

Sorry about that. I'm using htmlparser-1.4.15 via Maven with:

        <dependency>
            <groupId>nu.validator</groupId>
            <artifactId>htmlparser</artifactId>
            <version>1.4.15</version>
        </dependency>

sideshowbarker commented 4 years ago

The https://repo1.maven.org/maven2/nu/validator/htmlparser/1.4.15/htmlparser-1.4.15.jar distribution isn’t actually intended to be used outside the context of the HTML checker. It’s built from a separate branch, the validator-nu branch.

The https://repo1.maven.org/maven2/nu/validator/htmlparser/htmlparser/1.4/htmlparser-1.4.jar distribution is the most-recent release built from the master branch. Can you please try re-testing with that?

If the same problem can be reproduced with that htmlparser-1.4.jar distribution, then we know we have the same issue on master; and on the other hand, it if can’t be reproduced with that jar, then we know the cause is a change that was introduced on the validator-nu branch.

anthonyvdotbe commented 4 years ago

Thanks for pointing that out. Yes, the issue reproduces with the htmlparser-1.4.jar distribution as well. I've also tried by building from the current master branch myself (with mvn clean verify), and the same issue arises.

sideshowbarker commented 4 years ago

I haven’t tested your code yet but my guess is, this is namespace issue. This parser is an HTML parser, not a general XML parser. So it puts documents into the http://www.w3.org/1999/xhtml namespace. So I think any XPath query isn’t going to return any nodes unless you either check for h3 elements in the http://www.w3.org/1999/xhtml namespace, or else check for the local name.

sideshowbarker commented 4 years ago

So yeah I think you want to try this:

private static final String QUERY = "//*[local-name() = 'h3']";

As far as alternatively using a namespace-aware XPath expression, I think maybe the HTML parser doesn’t itself assign any namespace prefix to the http://www.w3.org/1999/xhtml namespace — so I guess the only way do a namespace-aware query on output from it would be for your application code to assign a namespace to it (if there’s even a way to actually do that).

sideshowbarker commented 4 years ago

Also be aware that the tree you get from parsing the document in your code with the HTML parser is going be different from the tree you get with an XML parser; the HTML parser will give this:

<!--?xml version="1.0" encoding="UTF-8" standalone="yes" ?-->
<html><head></head><body>
<h3>foo</h3>
<h3>bar</h3>
<h3>baz</h3>

</body></html>

So, the HTML parser parses the XML declaration as a comment, and adds an (empty) head element. But with a more-complex document, there are lots of other places where you could end up with a tree that’s very different from what an XML parser will give you.

anthonyvdotbe commented 4 years ago

Thanks a lot for your help, this is a non-issue indeed. For future reference, the following code allows to use //h:h3 as the XPath query:

import java.io.StringReader;
import java.util.Iterator;

import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;

import nu.validator.htmlparser.common.XmlViolationPolicy;
import nu.validator.htmlparser.dom.HtmlDocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class App {

    private static final String SOURCE = 
            "<!DOCTYPE html>" +
            "<html>" +
                "<body>" +
                    "<h3>foo</h3>" +
                    "<h3>bar</h3>" +
                    "<h3>baz</h3>" +
                "</body>" +
            "</html>";

    private static final String QUERY = "//h:h3";

    public static void main(String... args) throws Exception {
        query(new HtmlDocumentBuilder(XmlViolationPolicy.ALLOW));
    }

    private static void query(DocumentBuilder builder) throws Exception {
        Document document = builder.parse(new InputSource(new StringReader(SOURCE)));
        XPath xPath = XPathFactory.newInstance().newXPath();
        xPath.setNamespaceContext(new NamespaceContext(){

            @Override
            public Iterator<String> getPrefixes(String namespaceURI) {
                throw new UnsupportedOperationException();
            }

            @Override
            public String getPrefix(String namespaceURI) {
                throw new UnsupportedOperationException();
            }

            @Override
            public String getNamespaceURI(String prefix) {
                if(prefix.equals("h")) {
                    return "http://www.w3.org/1999/xhtml";
                } else {
                    throw new UnsupportedOperationException();
                }
            }

        });
        XPathExpression query = xPath.compile(QUERY);
        var numResults = ((NodeList) query.evaluate(document, XPathConstants.NODESET)).getLength();
        System.out.println(numResults);
    }

}