omegahat / XML

The XML package for R
Other
20 stars 11 forks source link

htmlTreeParse Error #9

Closed nick-ulle closed 7 years ago

nick-ulle commented 7 years ago

There seems to be a bug with htmlTreeParse() in XML 3.98-1.4 on R 3.3.2. Here's a minimal example:

link = "http://anson.ucdavis.edu/~mueller/cveng13.html"
tree = htmlTreeParse(link)
tree_body = tree$children$html[[2]]
tree_div = getNodeSet(tree_body, path="//div")

The error message is:

Failed to parse QName 'padding-left:'
Failed to parse QName 'padding-bottom:'
Failed to parse QName 'padding-top:'
Comment must not contain '--' (double-hyphen)
Comment must not contain '--' (double-hyphen)
Comment must not contain '--' (double-hyphen)
Error: 1: Failed to parse QName 'padding-left:'
2: Failed to parse QName 'padding-bottom:'
3: Failed to parse QName 'padding-top:'
4: Comment must not contain '--' (double-hyphen)
5: Comment must not contain '--' (double-hyphen)
6: Comment must not contain '--' (double-hyphen)

This error does not occur with htmlParse().

dsidavis commented 7 years ago

htmlTreeParse() returns an R representation of the XML document. That cannot be used with getNodeSet(). That requires htmlParse() or htmlTreeParse( , useInternalNodes = TRUE).

nick-ulle commented 7 years ago

Thanks.