r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
218 stars 83 forks source link

Replace S3 dispatch by C implementation #400

Closed mgirlich closed 10 months ago

mgirlich commented 1 year ago

Closes #386.

library(xml2)

data <- read_xml("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/reed.xml")
many_elts <- data |> 
  xml_contents() |> 
  xml_contents() |>
  xml_contents()

path <- xml2_example("cd_catalog.xml")
xml <- read_xml(path)
cds <- xml |> xml_children()
cd_contents <- cds |> xml_contents()

bench::mark(
  cds = xml_name(cds),
  cd_contents = xml_name(cd_contents),
  many_elts = xml_name(many_elts),
  check = FALSE
)

# DEV
#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           78.6µs   82.5µs   11139.     6.95KB     23.7
#> 2 cd_contents  447.2µs  480.9µs    1940.     1.27KB     21.2
#> 3 many_elts     28.2ms   30.1ms      33.3   65.72KB     42.8

# This PR
#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           3.52µs   3.87µs   214805.    2.66KB    21.5 
#> 2 cd_contents  13.03µs   14.1µs    61534.    1.27KB    18.5 
#> 3 many_elts     1.23ms   1.71ms      572.   65.72KB     8.36

Created on 2023-08-23 with reprex v2.0.2

More benchmarks ``` r library(xml2) data <- read_xml("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/reed.xml") many_elts <- data |> xml_contents() |> xml_contents() |> xml_contents() path <- xml2_example("cd_catalog.xml") xml <- read_xml(path) cds <- xml |> xml_children() cd_contents <- cds |> xml_contents() bench::mark( cds = xml_name(cds), cd_contents = xml_name(cd_contents), many_elts = xml_name(many_elts), check = FALSE ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 66µs 78.6µs 11863. 6.95KB 23.8 #> 2 cd_contents 378.9µs 449.6µs 2069. 1.27KB 23.5 #> 3 many_elts 24.3ms 27.9ms 36.3 65.72KB 40.8 #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 2.85µs 3.07µs 277778. 2.66KB 0 #> 2 cd_contents 13.17µs 14.23µs 62456. 1.27KB 18.7 #> 3 many_elts 1.1ms 1.63ms 616. 65.72KB 8.29 bench::mark( cds = xml_attr(cds, "test"), cd_contents = xml_attr(cd_contents, "test"), many_elts = xml_attr(many_elts, "test"), check = FALSE ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 80.7µs 95.9µs 9750. 8.66KB 25.5 #> 2 cd_contents 464.3µs 538.8µs 1779. 1.27KB 23.6 #> 3 many_elts 29.2ms 30.4ms 32.9 65.72KB 132. #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 2.77µs 3.01µs 292658. 3.44KB 29.3 #> 2 cd_contents 8.88µs 9.66µs 87308. 1.27KB 26.2 #> 3 many_elts 814.71µs 1.02ms 889. 65.72KB 10.4 bench::mark( cds = xml_attrs(cds), cd_contents = xml_attrs(cd_contents), many_elts = xml_attrs(many_elts), check = FALSE ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 71.6µs 83.8µs 11029. 6.48KB 25.6 #> 2 cd_contents 401.2µs 475.8µs 1997. 1.27KB 26.8 #> 3 many_elts 24.5ms 25.9ms 39.1 65.72KB 102. #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 3.54µs 4.39µs 194171. 3.02KB 38.8 #> 2 cd_contents 15.28µs 18.69µs 44680. 1.27KB 44.7 #> 3 many_elts 1.07ms 1.42ms 671. 65.72KB 29.9 bench::mark( cds = xml_text(cds), cd_contents = xml_text(cd_contents), many_elts = xml_text(many_elts), check = FALSE ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 94.8µs 110.1µs 7962. 11.24KB 22.1 #> 2 cd_contents 526.7µs 653.4µs 1241. 1.27KB 19.2 #> 3 many_elts 43.7ms 48.6ms 21.3 65.72KB 56.7 #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 11.8µs 12.22µs 74143. 5.14KB 7.42 #> 2 cd_contents 37.14µs 42.75µs 22727. 1.27KB 9.09 #> 3 many_elts 1.94ms 2.36ms 418. 65.72KB 6.30 bench::mark( cds = xml_length(cds), cd_contents = xml_length(cd_contents), many_elts = xml_length(many_elts), check = FALSE ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 75.1µs 85.1µs 8996. 8.69KB 18.9 #> 2 cd_contents 417.2µs 475.9µs 1739. 672B 21.2 #> 3 many_elts 23.6ms 27.2ms 35.7 32.88KB 45.9 #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 1.9µs 2.06µs 418749. 2.41KB 0 #> 2 cd_contents 6.13µs 6.71µs 121050. 672B 36.3 #> 3 many_elts 712.37µs 852.89µs 1075. 32.88KB 14.7 bench::mark( cds = xml_path(cds), cd_contents = xml_path(cd_contents), many_elts = xml_path(many_elts), check = FALSE ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 80.9µs 96.3µs 8871. 6.12KB 18.9 #> 2 cd_contents 499.1µs 614.8µs 1412. 1.27KB 14.6 #> 3 many_elts 56.2ms 67ms 15.6 65.72KB 26.0 #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 15.7µs 18.5µs 52116. 2.46KB 5.21 #> 2 cd_contents 117.7µs 134.9µs 7305. 1.27KB 2.01 #> 3 many_elts 25.6ms 29ms 33.5 65.72KB 0 bench::mark( cds = xml_type(cds), cd_contents = xml_type(cd_contents), many_elts = xml_type(many_elts), check = FALSE ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 45.1µs 53.1µs 15496. 8.1KB 21.1 #> 2 cd_contents 232.4µs 281.6µs 2984. 1.92KB 21.3 #> 3 many_elts 16.1ms 17.8ms 52.9 98.6KB 26.5 #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 cds 1.85µs 2.22µs 360872. 5.73KB 36.1 #> 2 cd_contents 6.04µs 8.2µs 104980. 1.92KB 31.5 #> 3 many_elts 702.33µs 870µs 1076. 98.6KB 15.0 ``` Created on 2023-08-30 with [reprex v2.0.2](https://reprex.tidyverse.org)
mgirlich commented 1 year ago

@hadley It would be great to get your opinion of whether this approach is fine for you. It kind of works against the dispatch system. But I'm also not quite sure whether the dispatch is really needed as there are only three classes (, and ) and I don't think the user really needs to extend them.

The performance gains are quite big (around a factor of 20), which would be nice for e.g. the {paws} package (a package to work with AWS).

Overall, the code is pretty redundant so we could also use a macro or some other CPP techniques but I'm not so familiar with C/CPP to know about the potential downsides.

hadley commented 9 months ago

This breaks fhircracker

setClass(
    Class = "fhir_bundle_xml",
    contains = c("fhir_resource_xml", "fhir_bundle"),
    slots = c(next_link = "fhir_url", self_link = "fhir_url"),
    prototype = prototype(xml2::read_xml(x = "<Bundle></Bundle>"))
)
hadley commented 9 months ago

I'm guessing this is because while inherits(x, "xml_node") is still true, Rf_inherits() doesn't handle S4 classes.

Yeah — looks like it https://github.com/wch/r-source/blob/d6dad605b05810cb43f991d292169a2cd436a818/src/include/Rinlinedfuns.h#L774.