r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
220 stars 82 forks source link

`read_html()` doesn't report parsing failure on very very long lines #440

Open hadley opened 9 months ago

hadley commented 9 months ago
library(xml2)

path <- tempfile()

long <- paste0("start", strrep("x", 12e6), "end")
nchar(long)
#> [1] 12000008

cat(
  "<html><body>\n<script type=\"application/json\">",
  long,
  "</script>\n</body></html>\n",
  file = path,
  sep = ""
)

html <- read_html(path)
xml <- read_xml(path)
#> Warning in read_xml.character(path): xmlSAX2Characters: huge text nod [2]
#> Error in read_xml.character(path): Extra content at the end of the document [5]

Created on 2024-02-27 with reprex v2.1.0

From https://github.com/tidyverse/rvest/issues/399