tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

Long lines truncated at 10,000,000 chars. #399

Closed barryrowlingson closed 6 months ago

barryrowlingson commented 6 months ago

Long lines in HTML are truncated at 10 million characters.

out = "test.html"

### make a char vec of ~12M chars with start and end marker
long = paste0(letters[((1:12000000)%%26)+1],collapse="")
long = paste0("start",long,"end", collapse="")

nchar(long)

### write to a file with some HTML tags.
cat(paste0("<html><body>\n<script type=\"application/json\">",
           long,
           "</script>\n</body></html>\n"), file=out)

### scrape package
library(rvest)

### read the file
page = read_html(out)

### get the nodes by xpath
nodes = html_nodes(page,xpath = '//script[@type="application/json"]')

### get the node content text
text = html_text(nodes[[1]])

### should be about 12 million
nchar(text)

### try this way
chars = as.character(nodes[[1]])

### also should be 12 million
nchar(chars)

### whats at the end?
substr(chars, nchar(chars)-40, nchar(chars))

I get:

> ### should be about 12 million
> nchar(text)
[1] 10000000

> ### try this way
> chars = as.character(nodes[[1]])

> ### also should be 12 million
> nchar(chars)
[1] 10000041

> ### whats at the end?
> substr(chars, nchar(chars)-40, nchar(chars))
[1] "abcdefghijklmnopqrstuvwxyzabcdef</script>"

showing truncation at 10000000 chars and the as.character form has truncated the content and put a script closing tag at the end to make well-formed HTML from truncated data. This is all done silently with no errors or warnings.

The real-world case of this was an HTML file created by the leaflet package which creates large single lines of geographic data.

The Python packages requests and BeautifulSoup read this all correctly.

> packageVersion("rvest")
[1] ‘1.0.4’
> version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          3.1                         
year           2023                        
month          06                          
day            16                          
svn rev        84548                       
language       R                           
version.string R version 4.3.1 (2023-06-16)
nickname       Beagle Scouts               
TimTaylor commented 6 months ago

Think you're hitting a limit in libxml2 https://www.suse.com/support/kb/doc/?id=000019477. Not sure if you need to rebuild or if this can be changed at runtime :shrug:

barryrowlingson commented 6 months ago

if I try reading with XML::xmlParse I at least get an error:

> xmlParse("./test.html")
xmlSAX2Characters: huge text nodeExtra content at the end of the document
Error: 1: xmlSAX2Characters: huge text node2: Extra content at the end of the document

Looks like the xml2 package is silently failing to report the truncation. I'll file an issue there, if there's not one there already....

barryrowlingson commented 6 months ago

Seems I have "options"...

> # how big is the input?
> file.size("large.html")
[1] 12000078

> # read it, then write it:
> l = rvest::read_html("large.html")

> xml2::write_html(l, "large-huge.html")

> # check size for truncation
> file.size("large-huge.html")
[1] 10000177

> # HUUUUUGE
> l = rvest::read_html("large.html", options="HUGE")

> xml2::write_html(l, "large-huge.html")

> # not truncated
> file.size("large-huge.html")
[1] 12000185

However I can't find an option that will make rvest::read_html note the error, but this is probably passed down to xml2::read_html...

hadley commented 6 months ago

Now filed at https://github.com/r-lib/xml2/issues/440. I suspect there's not going to be much we can do apart from turning HUGE on by default, but this does seem like pretty unappealing behaviour by libxml2.