r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
220 stars 81 forks source link

Trim all xml_text results from a nodeset at the same time #386

Closed WerthPADOH closed 1 year ago

WerthPADOH commented 1 year ago

xml_text can take a long time when used with a large nodeset when trim is TRUE, and most of that time is spent calling sub twice for each node. One experience I had was getting the text for a nodeset with about 4.5 million nodes, where Rprof showed that calls to sub made up 72% of the nearly two minutes spent in xml_text.

I'll happily make a pull request if you'd like

Here's a brief demo showing time saved:

library(xml2)
library(microbenchmark)

blob <- paste0(c("<x>", rep("<y> Hi </y>", 1000), "</x>"), collapse = "")
tree <- read_xml(blob)
y_tags <- xml_find_all(tree, "//y")

xml_text_sub_after <- function(x, trim = FALSE) {
  res <- vapply(x, xml_text, trim = FALSE, FUN.VALUE = character(1))
  if (isTRUE(trim)) {
      res <- sub("^[[:space:] ]+", "", res)
      res <- sub("[[:space:] ]+$", "", res)
  }
  res
}

microbenchmark(
  as_is = xml_text(y_tags, trim = TRUE),
  proposed = xml_text_sub_after(y_tags, trim = TRUE),
  check = "identical"
)
# Unit: milliseconds
#      expr     min      lq      mean   median       uq     max neval
#     as_is 35.2209 35.9433 37.787917 36.64865 38.15000 58.4016   100
#  proposed  6.7719  6.9522  7.625734  7.16715  7.40195 16.2116   100