tidymodels / infer

An R package for tidyverse-friendly statistical inference
https://infer.tidymodels.org
Other
726 stars 80 forks source link

speed up `make_replicate_tbl()` #490

Closed simonpcouch closed 1 year ago

simonpcouch commented 1 year ago

A helper that is a backend for rep_slice_sample() and generate(..., type = "bootstrap"). About a 2x speedup in this example.

With main dev:

library(infer)

gss_hyp <- 
   gss %>%
   specify(response = hours) %>%
   hypothesize(null = "point", mu = 40)

bench::mark(
   generate = generate(gss_hyp, reps = 1000, type = "bootstrap")
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 generate     24.3ms   24.3ms      41.2    33.1MB     864.

With this PR:

library(infer)

gss_hyp <- 
   gss %>%
   specify(response = hours) %>%
   hypothesize(null = "point", mu = 40)

bench::mark(
   generate = generate(gss_hyp, reps = 1000, type = "bootstrap")
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 generate     12.1ms   12.9ms      77.4    17.1MB     141.

The results are indeed the same. With main dev:

library(infer)

set.seed(1)

gss %>%
   specify(response = hours) %>%
   hypothesize(null = "point", mu = 40) %>%
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "mean")
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1,000 × 2
#>    replicate  stat
#>        <int> <dbl>
#>  1         1  40.0
#>  2         2  38.9
#>  3         3  40.5
#>  4         4  40.3
#>  5         5  40.1
#>  6         6  40.6
#>  7         7  40.2
#>  8         8  40.1
#>  9         9  40.4
#> 10        10  39.9
#> # ℹ 990 more rows

With this PR:

library(infer)

set.seed(1)

gss %>%
   specify(response = hours) %>%
   hypothesize(null = "point", mu = 40) %>%
   generate(reps = 1000, type = "bootstrap") %>%
   calculate(stat = "mean")
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1,000 × 2
#>    replicate  stat
#>        <int> <dbl>
#>  1         1  40.0
#>  2         2  38.9
#>  3         3  40.5
#>  4         4  40.3
#>  5         5  40.1
#>  6         6  40.6
#>  7         7  40.2
#>  8         8  40.1
#>  9         9  40.4
#> 10        10  39.9
#> # ℹ 990 more rows

Created on 2023-04-10 with reprex v2.0.2

github-actions[bot] commented 1 year ago

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.