xml_text can take a long time when used with a large nodeset when trim is TRUE, and most of that time is spent calling sub twice for each node. One experience I had was getting the text for a nodeset with about 4.5 million nodes, where Rprof showed that calls to sub made up 72% of the nearly two minutes spent in xml_text.
I'll happily make a pull request if you'd like
Here's a brief demo showing time saved:
library(xml2)
library(microbenchmark)
blob <- paste0(c("<x>", rep("<y> Hi </y>", 1000), "</x>"), collapse = "")
tree <- read_xml(blob)
y_tags <- xml_find_all(tree, "//y")
xml_text_sub_after <- function(x, trim = FALSE) {
res <- vapply(x, xml_text, trim = FALSE, FUN.VALUE = character(1))
if (isTRUE(trim)) {
res <- sub("^[[:space:] ]+", "", res)
res <- sub("[[:space:] ]+$", "", res)
}
res
}
microbenchmark(
as_is = xml_text(y_tags, trim = TRUE),
proposed = xml_text_sub_after(y_tags, trim = TRUE),
check = "identical"
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# as_is 35.2209 35.9433 37.787917 36.64865 38.15000 58.4016 100
# proposed 6.7719 6.9522 7.625734 7.16715 7.40195 16.2116 100
xml_text
can take a long time when used with a large nodeset whentrim
isTRUE
, and most of that time is spent callingsub
twice for each node. One experience I had was getting the text for a nodeset with about 4.5 million nodes, whereRprof
showed that calls tosub
made up 72% of the nearly two minutes spent inxml_text
.I'll happily make a pull request if you'd like
Here's a brief demo showing time saved: