larmarange / broom.helpers

A set of functions to facilitate manipulation of tibbles produced by broom
https://larmarange.github.io/broom.helpers/
GNU General Public License v3.0
21 stars 8 forks source link

Execution time #254

Closed lucasxteixeira closed 2 months ago

lucasxteixeira commented 2 months ago

Hi,

I have been experiencing issues with execution time when using the tidy_plus_plus package. After investigating, I found that some steps significantly increase execution time as the size of the data input grows. While I expected this for add_n, I did not anticipate it for other steps.

For example, a simple linear model with 1 million data points takes approximately 0.8 seconds with the regular tidy, whereas tidy_plus_plus (without add_n) takes around 3 seconds. Although this difference might not seem substantial, my real use case involves a survival model with 14 million data points, making the increased execution time impractical.

The steps that notably increase execution time in my testing are:

Reproducible Example:

set.seed(42)
size <- 1000000
df1 <- data.frame(
  y = rnorm(size),
  x1 = rnorm(size),
  x2 = sample(c(TRUE, FALSE), size, replace = TRUE),
  x3 = factor(sample(c("A", "B", "C"), size, replace = TRUE))
)
m1 <- lm(y ~ x1 + x2 + x3, data = df1)
tidy_res1 <- broom::tidy(m1, conf.int = TRUE)

size <- 100000
df2 <- data.frame(
  y = rnorm(size),
  x1 = rnorm(size),
  x2 = sample(c(TRUE, FALSE), size, replace = TRUE),
  x3 = factor(sample(c("A", "B", "C"), size, replace = TRUE))
)
m2 <- lm(y ~ x1 + x2 + x3, data = df2)
tidy_res2 <- broom::tidy(m2, conf.int = TRUE)

microbenchmark::microbenchmark(
  broom.helpers::tidy_identify_variables(tidy_res1, model = m1),
  broom.helpers::tidy_identify_variables(tidy_res2, model = m2),
  times = 1000
)
larmarange commented 2 months ago

My first guess is that at many places we need information stored in the model.frame or the model.matrix. While the modle.frame is usually stored within the model results, this is not the case of the model.matrix. Therefore, it may be relevant to copy and attach the model.matrix (at least temporally) between the different steps. Needs to be tested.

ddsjoberg commented 2 months ago

I've never used it personally, but perhaps the memoise pkg could be useful. It caches results and only performs the calculation the first time, and each subsequent call grabs the results from cache

https://memoise.r-lib.org/

larmarange commented 2 months ago

Please have a look at #255 implemented some caching of model frame and model matrix to the model object. The argument model_matrix_attr = FALSE allows to desactivate this caching.

@ddsjoberg I didn't use memoise to avoid additional dependencies. Do you think it would be better to rely on it?

See below a quick benchmark before and after. @lucasxteixeira using this new version and avoiding to compute the number of observations per modality (i.e. add_n = TRUE), you can save some time (almost divided by 4 in this example).

library(broom.helpers)

set.seed(42)
size <- 5000000
df <- data.frame(
  y = rnorm(size),
  x1 = rnorm(size),
  x2 = sample(c(TRUE, FALSE), size, replace = TRUE),
  x3 = factor(sample(c("A", "B", "C"), size, replace = TRUE))
)
m <- lm(y ~ x1 + x2 + x3, data = df)

# without caching model matrix

s0 <- m |>
  tidy_and_attach(model_matrix_attr = FALSE)
s1 <- s0 |> 
  tidy_identify_variables()
s2 <- s1 |> 
  tidy_add_contrasts()
s3 <- s2 |> 
  tidy_add_reference_rows()
s4 <- s3 |> 
  tidy_add_estimate_to_reference_rows()
s5 <- s4 |> 
  tidy_add_term_labels()
s6 <- s5 |> 
  tidy_add_header_rows()
s7 <- s6 |> 
  tidy_add_n()

b <- microbenchmark::microbenchmark(
  m |> tidy_and_attach(),
  s0 |> tidy_identify_variables(),
  s1 |> tidy_add_contrasts(),
  s2 |> tidy_add_reference_rows(),
  s3 |> tidy_add_estimate_to_reference_rows(),
  s4 |> tidy_add_term_labels(),
  s5 |> tidy_add_header_rows(),
  s6 |> tidy_add_n(),
  times = 5
)
b
#> Unit: milliseconds
#>                                     expr       min        lq       mean
#>                       tidy_and_attach(m) 1308.1122 1342.4657 1511.52028
#>              tidy_identify_variables(s0)  891.9426  892.2176 1013.81794
#>                   tidy_add_contrasts(s1)  906.9803  950.8952 1199.12334
#>              tidy_add_reference_rows(s2) 1740.8123 1914.9015 2304.18500
#>  tidy_add_estimate_to_reference_rows(s3)    2.4704    2.5295    2.83704
#>                 tidy_add_term_labels(s4) 1770.3298 1904.9071 2170.40976
#>                 tidy_add_header_rows(s5)   18.3213   21.1508   21.79842
#>                           tidy_add_n(s6) 4908.2289 5417.0192 5593.61400
#>     median        uq       max neval
#>  1369.7221 1452.1779 2085.1235     5
#>  1003.9495 1091.5755 1189.4045     5
#>  1124.5302 1486.5224 1526.6886     5
#>  2298.6875 2490.7164 3075.8073     5
#>     2.9398    3.0409    3.2046     5
#>  2081.1680 2132.5292 2963.1147     5
#>    21.2039   21.7075   26.6086     5
#>  5766.6404 5923.1059 5953.0756     5

# with caching model matrix

s0 <- m |>
  tidy_and_attach()
s1 <- s0 |> 
  tidy_identify_variables()
s2 <- s1 |> 
  tidy_add_contrasts()
s3 <- s2 |> 
  tidy_add_reference_rows()
s4 <- s3 |> 
  tidy_add_estimate_to_reference_rows()
s5 <- s4 |> 
  tidy_add_term_labels()
s6 <- s5 |> 
  tidy_add_header_rows()
s7 <- s6 |> 
  tidy_add_n()

b <- microbenchmark::microbenchmark(
  m |> tidy_and_attach(),
  s0 |> tidy_identify_variables(),
  s1 |> tidy_add_contrasts(),
  s2 |> tidy_add_reference_rows(),
  s3 |> tidy_add_estimate_to_reference_rows(),
  s4 |> tidy_add_term_labels(),
  s5 |> tidy_add_header_rows(),
  s6 |> tidy_add_n(),
  times = 5
)
b
#> Unit: milliseconds
#>                                     expr       min        lq       mean
#>                       tidy_and_attach(m) 1183.0690 1260.9537 1618.52904
#>              tidy_identify_variables(s0)   28.6777   28.7928   37.62974
#>                   tidy_add_contrasts(s1)   35.9631   36.2232   51.41974
#>              tidy_add_reference_rows(s2)  202.8618  216.1722  247.23770
#>  tidy_add_estimate_to_reference_rows(s3)    2.0402    2.0497    2.89266
#>                 tidy_add_term_labels(s4)  115.9453  133.5421  144.57326
#>                 tidy_add_header_rows(s5)   17.4916   21.6246   24.74484
#>                           tidy_add_n(s6) 1734.7273 2098.8633 2124.85670
#>     median        uq       max neval
#>  1736.3493 1894.6854 2017.5878     5
#>    35.5844   47.5241   47.5697     5
#>    44.9223   64.5268   75.4633     5
#>   226.8496  294.4951  295.8098     5
#>     2.5728    3.2854    4.5152     5
#>   138.2370  139.0601  196.0818     5
#>    26.8611   28.5671   29.1798     5
#>  2193.1916 2199.2863 2398.2150     5

# overall gain

microbenchmark::microbenchmark(
  tidy_plus_plus(m, model_matrix_attr = FALSE),
  tidy_plus_plus(m),
  tidy_plus_plus(m, add_n = FALSE),
  tidy_plus_plus(m, add_n = FALSE, model_matrix_attr = FALSE),
  times = 5
)
#> Unit: seconds
#>                                                         expr       min
#>                 tidy_plus_plus(m, model_matrix_attr = FALSE) 11.806868
#>                                            tidy_plus_plus(m)  5.100999
#>                             tidy_plus_plus(m, add_n = FALSE)  2.494101
#>  tidy_plus_plus(m, add_n = FALSE, model_matrix_attr = FALSE)  6.815652
#>         lq      mean    median        uq       max neval
#>  12.237099 12.544947 12.771944 12.951536 12.957286     5
#>   5.302496  7.109005  5.343877  5.869818 13.927835     5
#>   2.565271  2.959569  2.737820  2.961719  4.038936     5
#>   7.302845 12.897389 13.297690 18.419446 18.651313     5

Created on 2024-07-01 with reprex v2.1.0

lucasxteixeira commented 2 months ago

Thank you! It was a very fast solution. The performance improved quite a bit.