Closed barryrowlingson closed 6 months ago
Think you're hitting a limit in libxml2 https://www.suse.com/support/kb/doc/?id=000019477. Not sure if you need to rebuild or if this can be changed at runtime :shrug:
if I try reading with XML::xmlParse
I at least get an error:
> xmlParse("./test.html")
xmlSAX2Characters: huge text nodeExtra content at the end of the document
Error: 1: xmlSAX2Characters: huge text node2: Extra content at the end of the document
Looks like the xml2
package is silently failing to report the truncation. I'll file an issue there, if there's not one there already....
Seems I have "options"...
> # how big is the input?
> file.size("large.html")
[1] 12000078
> # read it, then write it:
> l = rvest::read_html("large.html")
> xml2::write_html(l, "large-huge.html")
> # check size for truncation
> file.size("large-huge.html")
[1] 10000177
> # HUUUUUGE
> l = rvest::read_html("large.html", options="HUGE")
> xml2::write_html(l, "large-huge.html")
> # not truncated
> file.size("large-huge.html")
[1] 12000185
However I can't find an option that will make rvest::read_html
note the error, but this is probably passed down to xml2::read_html...
Now filed at https://github.com/r-lib/xml2/issues/440. I suspect there's not going to be much we can do apart from turning HUGE
on by default, but this does seem like pretty unappealing behaviour by libxml2.
Long lines in HTML are truncated at 10 million characters.
I get:
showing truncation at 10000000 chars and the
as.character
form has truncated the content and put a script closing tag at the end to make well-formed HTML from truncated data. This is all done silently with no errors or warnings.The real-world case of this was an HTML file created by the
leaflet
package which creates large single lines of geographic data.The Python packages
requests
andBeautifulSoup
read this all correctly.