tidymodels / infer

An R package for tidyverse-friendly statistical inference
https://infer.tidymodels.org
Other
726 stars 80 forks source link

speed up `group_by(replicate)` #492

Closed simonpcouch closed 1 year ago

simonpcouch commented 1 year ago

infer pipelines spend a good bit of time in group_by(), and many of those usages are with x %>% group_by(replicate). Since we know how that replicate column is built, we can tap into dplyr's developer interfaces for group_by() to speed that computation up quite a bit.

library(infer)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

gss_gen <- 
   gss %>%
   specify(age ~ college) %>%
   hypothesize(null = "independence") %>%
   generate(reps = 100, type = "permute") %>%
   ungroup()

# note `check = TRUE` checks equality of results
bm <- 
   bench::mark(
      old = group_by(gss_gen, replicate),
      new = infer:::group_by_replicate(gss_gen, reps = 100, n = 500),
      check = TRUE
   )

bm
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old          1.03ms    1.1ms      872.    1.01MB     24.7
#> 2 new          94.5µs  115.9µs     8504.  404.34KB     71.3

# `old` is ___ times slower than `new`
as.numeric(bm$median[[1]]) / as.numeric(bm$median[[2]])
#> [1] 9.461464

Created on 2023-04-10 with reprex v2.0.2

Putting these changes in context using the longer-running examples from calculate(), with main dev:

library(infer)

bench::mark(
   mean = 
      gss %>%
      specify(response = hours) %>%
      hypothesize(null = "point", mu = 40) %>%
      generate(reps = 200, type = "bootstrap") %>%
      calculate(stat = "mean"),
   diff_in_means = 
      gss %>%
      specify(age ~ college) %>%
      hypothesize(null = "independence") %>%
      generate(reps = 200, type = "permute") %>%
      calculate("diff in means", order = c("degree", "no degree")),
   check = FALSE
)
#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 mean            18.2ms   21.1ms     48.6     16.4MB     25.3
#> 2 diff_in_means  481.3ms  484.6ms      2.06    13.3MB     27.9

With this PR:

#> # A tibble: 2 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 mean            14.6ms   16.4ms     59.0     14.2MB     27.5
#> 2 diff_in_means  445.9ms  448.4ms      2.23    13.6MB     27.9

Created on 2023-04-10 with reprex v2.0.2

github-actions[bot] commented 1 year ago

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.