vincentarelbundock / modelsummary

Beautiful and customizable model summaries in R.
http://modelsummary.com
Other
913 stars 76 forks source link

Duplicate models in parallel mode: computation needs several minutes to compute #512

Closed zauster closed 2 years ago

zauster commented 2 years ago

Hey,

if I include one model twice in the list of models to be summarised and try to parallelise (using plan(multisession)) the computation, it takes several minutes until the output is printed.

REPREX:

library(data.table)
library(modelsummary)
library(future)

N <- 5e4
dprob <- 0.004
dnum <- 10
d_mat <- sample(c(0L, 1L), size = N * dnum, replace = TRUE,
                prob = c(1 - dprob, dprob)) |>
  matrix(ncol = dnum) |>
  as.data.table()
dt_tmp <- data.table(
  depvar = rpois(n = N, lambda = 4),
  value = rnorm(n = N),
  d_mat
)
all_dummies <- paste0("V", 1:dnum)

## 1. very small model
est_formula <- reformulate(
  termlabels = c("value"),
  response = "depvar")
res_1 <- glm(est_formula, family = poisson(), data = dt_tmp)

## 2. model with some dummies
est_formula <- reformulate(
  termlabels = c("value", all_dummies),
  response = "depvar")
res_2 <- glm(est_formula, family = poisson(), data = dt_tmp)

##------------------------------------------------------------
## compare sequential and parallel (multisession)
start_sequential <- Sys.time()
plan(sequential)
modelsummary(list(res_1, res_2), output = "markdown",
             statistic = NULL, vcov = NULL, conf_level = NULL,
             gof_omit = "AIC|BIC|RMSE")
(duration_sequential <- difftime(Sys.time(), start_sequential))
## finished in ~8 seconds

## do it in parallel
start_multisession <- Sys.time()
plan(multisession, workers = 2)
modelsummary(list(res_1, res_2), output = "markdown",
             statistic = NULL, vcov = NULL, conf_level = NULL,
             gof_omit = "AIC|BIC|RMSE")
(duration_multisession2 <- difftime(Sys.time(), start_multisession))
## finished in ~10 seconds

## include model #2 twice
start_multisession <- Sys.time()
plan(multisession, workers = 3)
modelsummary(list(res_1, res_2, res_2), output = "markdown",
             statistic = NULL, vcov = NULL, conf_level = NULL,
             gof_omit = "AIC|BIC|RMSE")
(duration_multisession3 <- difftime(Sys.time(), start_multisession))
## finished in ~3 minutes

sessionInfo:

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_zenp-r0.3.20.so

locale:
 [1] LC_CTYPE=de_AT.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=de_AT.UTF-8    
 [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=de_AT.UTF-8   
 [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] kableExtra_1.3.4         future_1.26.1            modelsummary_0.11.1.9000
[4] here_1.0.1               data.table_1.14.2        conflicted_1.1.0        

loaded via a namespace (and not attached):
 [1] highr_0.9          compiler_4.2.0     parameters_0.18.1  tools_4.2.0       
 [5] digest_0.6.29      viridisLite_0.4.0  memoise_2.0.1      evaluate_0.15     
 [9] lifecycle_1.0.1    checkmate_2.1.0    rlang_1.0.2        cli_3.3.0         
[13] rstudioapi_0.13    parallel_4.2.0     xfun_0.31          fastmap_1.1.0     
[17] stringr_1.4.0      bayestestR_0.12.1  httr_1.4.3         knitr_1.39        
[21] xml2_1.3.3         systemfonts_1.0.4  generics_0.1.2     globals_0.15.1    
[25] webshot_0.5.3      rprojroot_2.0.3    svglite_2.1.0      glue_1.6.2        
[29] performance_0.9.1  listenv_0.8.0      R6_2.5.1           future.apply_1.9.0
[33] parallelly_1.32.0  rmarkdown_2.14     magrittr_2.0.3     scales_1.2.0      
[37] tables_0.9.6       backports_1.4.1    codetools_0.2-18   htmltools_0.5.2   
[41] MASS_7.3-57        datawizard_0.4.1   insight_0.17.1     rvest_1.0.2       
[45] colorspace_2.0-3   stringi_1.7.6      munsell_0.5.0      cachem_1.0.6 
vincentarelbundock commented 2 years ago

Thanks a lot for the report.

I can't look into this right away, but will do so as soon as I can.

Note to self: Bengtsson presentation about profiling future parallel code https://www.jottr.org/2022/06/23/future-user2022-slides/

vincentarelbundock commented 2 years ago

There seems to be a lot of “send data to nodes” overhead, which I think I was able to limit a bit. But beyond that, I still don’t get it.

As an experiment, I created a new branch which accepts a mc.cores argument and parallelizes using parallel::mclapply(). It is much faster than single core.

remotes::install_github('vincentarelbundock/modelsummary@parallel')

Relevant code here: https://github.com/vincentarelbundock/modelsummary/blob/parallel/R/modelsummary.R

When I use future, things are very slow the first time around, but they are fast the second time I call the function.

Any ideas what could explain this?

library(tictoc)
library(insight)
library(modelsummary)
library(future.apply)

mod <- list(
    download_model("brms_mixed_2"),
    download_model("brms_mixed_3"),
    download_model("brms_mixed_5"))
mod[[4]] <- mod[[5]] <- mod[[6]] <- mod[[7]] <- mod[[1]]

# sequential is slow
tic()
tab <- modelsummary(
    mod,
    output = "data.frame",
    statistic = "conf.int")
toc()
#> 53.589 sec elapsed

# {parallel} is fast
tic()
tab <- modelsummary(
    mod,
    output = "data.frame",
    statistic = "conf.int",
    mc.cores = 7)
toc()
#> 17.353 sec elapsed

# {future}: 1st time is slow
plan(multisession)
tic()
tab <- modelsummary(
    mod,
    output = "data.frame",
    statistic = "conf.int")
toc()
#> 45.663 sec elapsed

# {future}: 2nd time is fast
tic()
tab <- modelsummary(
    mod,
    output = "data.frame",
    statistic = "conf.int")
toc()
#> 18.158 sec elapsed
vincentarelbundock commented 2 years ago

Alright, I just merged the parallel branch in "main". You can now add mc.cores=7 (or whatever number of cores), and modelsummary will invoke parallel::mcapply under the hood. On my linux machine, this scales not-quite-linearly. It should work on mac, and probably sometimes on Windows too.

future support is still there, but there may be some just-in-time compilation going on that slows down the first call.

I tried your example on my recent laptop and it went from roughly 6 to 4 to 2 seconds. Let me know if you see the same thing with the current version from Github.

vincentarelbundock commented 2 years ago

Closing issues to get a better sense of what TODOs are left on my plate, but feel free to re-open or keep commenting if this doesn't work for you.

HenrikBengtsson commented 2 years ago

plan(multicore) uses the same forked parallelization framework as mclapply(). What do you get if you use:

plan(multicore, workers=7)

in your comparison?

PS. Forked processing is not available on MS Windows, so such users will end up running in sequential mode.

vincentarelbundock commented 2 years ago

It works!

The first time I call my future-parallel function with "multicore" is as fast as parallel, and as fast as the the 2nd time I run the function.

I'm still kind of curious why the 1st and 2nd run had such different timings with "multisession"...

FWIW, I'm running Ubuntu inside a WSL2-Windows.

Thanks for looking!

zauster commented 2 years ago

I guess (but I could be wrong) that maybe there is some transferring of data to the "worker" R sessions the first time, which needs not be fine the second time?

Also, since/if there is no easy clear solution to this, why not simply remove all duplicates from the modellist, tell the user about it and continue? Do you know any real world examples where one would want to include one model twice (in one table)?

vincentarelbundock commented 2 years ago

Do you know any real world examples where one would want to include one model twice (in one table)?

Showing the same model with different types of sandwich standard errors would be one example.

But my sense is this "problem" is not actually related to duplicated models. It is just slow on first run regardless of what models the models list contains.

HenrikBengtsson commented 2 years ago

It's most likely because the workers need to load some packages during the first call. Since 'multisession' uses persistent PSOCK workers, the packages are already loaded is succeeding calls. If you preload some the heavy package up-front, e.g.

void <- future_lapply(1:nbrOfWorkers(), FUN = function(x) {
  loadNamespace("brms")
  loadNamespace("rstan")
})

you'll see a smaller different between the first and the second call.

BTW, you're missing to declare use of RNG in your parallelization. That is, specify future.seed = TRUE to get rid of the RNG warnings;

1: UNRELIABLE VALUE: One of the ‘future.apply’ iterations (‘future_lapply-1’) unexpectedly generated
 random numbers without declaring so. There is a risk that those random numbers are not statistically
 sound and the overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures
that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable
this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore". 
vincentarelbundock commented 2 years ago

Thanks a lot for these insights. Things are much clearer now for me, and I have update the code and documentation to reflect those insights. I have also set future.seed = TRUE as suggested.

Thanks!