Closed anthonyvdotbe closed 4 years ago
Which exact jar are you using? Where did the jar come from?
Sorry about that. I'm using htmlparser-1.4.15 via Maven with:
<dependency>
<groupId>nu.validator</groupId>
<artifactId>htmlparser</artifactId>
<version>1.4.15</version>
</dependency>
The https://repo1.maven.org/maven2/nu/validator/htmlparser/1.4.15/htmlparser-1.4.15.jar distribution isn’t actually intended to be used outside the context of the HTML checker. It’s built from a separate branch, the validator-nu branch.
The https://repo1.maven.org/maven2/nu/validator/htmlparser/htmlparser/1.4/htmlparser-1.4.jar distribution is the most-recent release built from the master branch. Can you please try re-testing with that?
If the same problem can be reproduced with that htmlparser-1.4.jar distribution, then we know we have the same issue on master; and on the other hand, it if can’t be reproduced with that jar, then we know the cause is a change that was introduced on the validator-nu branch.
Thanks for pointing that out. Yes, the issue reproduces with the htmlparser-1.4.jar distribution as well. I've also tried by building from the current master branch myself (with mvn clean verify
), and the same issue arises.
I haven’t tested your code yet but my guess is, this is namespace issue. This parser is an HTML parser, not a general XML parser. So it puts documents into the http://www.w3.org/1999/xhtml
namespace. So I think any XPath query isn’t going to return any nodes unless you either check for h3
elements in the http://www.w3.org/1999/xhtml
namespace, or else check for the local name.
So yeah I think you want to try this:
private static final String QUERY = "//*[local-name() = 'h3']";
As far as alternatively using a namespace-aware XPath expression, I think maybe the HTML parser doesn’t itself assign any namespace prefix to the http://www.w3.org/1999/xhtml
namespace — so I guess the only way do a namespace-aware query on output from it would be for your application code to assign a namespace to it (if there’s even a way to actually do that).
Also be aware that the tree you get from parsing the document in your code with the HTML parser is going be different from the tree you get with an XML parser; the HTML parser will give this:
<!--?xml version="1.0" encoding="UTF-8" standalone="yes" ?-->
<html><head></head><body>
<h3>foo</h3>
<h3>bar</h3>
<h3>baz</h3>
</body></html>
So, the HTML parser parses the XML declaration as a comment, and adds an (empty) head
element. But with a more-complex document, there are lots of other places where you could end up with a tree that’s very different from what an XML parser will give you.
Thanks a lot for your help, this is a non-issue indeed. For future reference, the following code allows to use //h:h3
as the XPath query:
import java.io.StringReader;
import java.util.Iterator;
import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import nu.validator.htmlparser.common.XmlViolationPolicy;
import nu.validator.htmlparser.dom.HtmlDocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
public class App {
private static final String SOURCE =
"<!DOCTYPE html>" +
"<html>" +
"<body>" +
"<h3>foo</h3>" +
"<h3>bar</h3>" +
"<h3>baz</h3>" +
"</body>" +
"</html>";
private static final String QUERY = "//h:h3";
public static void main(String... args) throws Exception {
query(new HtmlDocumentBuilder(XmlViolationPolicy.ALLOW));
}
private static void query(DocumentBuilder builder) throws Exception {
Document document = builder.parse(new InputSource(new StringReader(SOURCE)));
XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new NamespaceContext(){
@Override
public Iterator<String> getPrefixes(String namespaceURI) {
throw new UnsupportedOperationException();
}
@Override
public String getPrefix(String namespaceURI) {
throw new UnsupportedOperationException();
}
@Override
public String getNamespaceURI(String prefix) {
if(prefix.equals("h")) {
return "http://www.w3.org/1999/xhtml";
} else {
throw new UnsupportedOperationException();
}
}
});
XPathExpression query = xPath.compile(QUERY);
var numResults = ((NodeList) query.evaluate(document, XPathConstants.NODESET)).getLength();
System.out.println(numResults);
}
}
To reproduce, create an
App.java
file with the contents as below, and run it withhtmlparser
's jar available on the classpath. Note that the XPath expression only works when parsing the source with the JDK'sDocumentBuilder
. When using htmlparser'sHtmlDocumentBuilder
, it always returns an empty result.