infer pipelines spend a good bit of time in group_by(), and many of those usages are with x %>% group_by(replicate). Since we know how that replicate column is built, we can tap into dplyr's developer interfaces for group_by() to speed that computation up quite a bit.
library(infer)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
gss_gen <-
gss %>%
specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
ungroup()
# note `check = TRUE` checks equality of results
bm <-
bench::mark(
old = group_by(gss_gen, replicate),
new = infer:::group_by_replicate(gss_gen, reps = 100, n = 500),
check = TRUE
)
bm
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 old 1.03ms 1.1ms 872. 1.01MB 24.7
#> 2 new 94.5µs 115.9µs 8504. 404.34KB 71.3
# `old` is ___ times slower than `new`
as.numeric(bm$median[[1]]) / as.numeric(bm$median[[2]])
#> [1] 9.461464
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
infer pipelines spend a good bit of time in
group_by()
, and many of those usages are withx %>% group_by(replicate)
. Since we know how thatreplicate
column is built, we can tap into dplyr's developer interfaces forgroup_by()
to speed that computation up quite a bit.Created on 2023-04-10 with reprex v2.0.2
Putting these changes in context using the longer-running examples from
calculate()
, withmain
dev:With this PR:
Created on 2023-04-10 with reprex v2.0.2