Closed vincentarelbundock closed 2 years ago
@ajnafa
We have patched upstream packages. Could you please try to:
easystats/insight
easystats/bayestestR
vincentarelbundock/modelsummary
R
modelsummary(mod, test=NULL, diagnostic=NULL)
Thanks!
Much better performance times.
# Install development versions of the packages
remotes::install_github("easystats/insight")
remotes::install_github("easystats/bayestestR")
remotes::install_github("vincentarelbundock/modelsummary")
# Restart the rstudio session
rstudioapi::restartSession()
# Load the libraries
library(brms)
library(tidyverse)
library(modelsummary)
# Load the models as a list
soc_models_main <- map(
.x = list.files(
soc_models_dir,
pattern = ".*_Final.rds",
full.names = TRUE
),
~ read_rds(.x)
)
# I'm testing this on a list of eight models
system.time(posterior_table <- modelsummary(
models = soc_models_main,
output = "modelsummary_list",
statistic = "conf.int",
conf_level = 0.89,
test = NULL,
diagnostic = NULL,
metrics = "none"
))
#user system elapsed
#25.83 0.47 26.29
Great news!
Now type this first and call again. Should compute in parallel:
library(future.apply)
plan("multisession")
@saudiwin may also be interested to know that the upstream changes @strengejacke and I just pushed to bayestestR
and insight
could bring a 10x speed boost to (some) brms
models.
Edit: more like 2x. But still nice!
Here are the times for it with test
and diagnostic
not set to NULL
# With diagnostics
system.time(posterior_table <- modelsummary(
models = soc_models_main,
output = "modelsummary_list",
statistic = "conf.int",
conf_level = 0.89,
test = NULL,
metrics = "none"
))
# user system elapsed
# 204.47 0.39 204.86
# With test and diagnostics
system.time(posterior_table <- modelsummary(
models = soc_models_main,
output = "modelsummary_list",
statistic = "conf.int",
conf_level = 0.89,
metrics = "none"
))
# user system elapsed
# 223.83 1.00 225.17
So, performance is substantially worse with parallel computation enabled, at least under Windows
# Trying parallel computation (12 core Ryzen 9 5900X)
library(future.apply)
plan(multisession(workers = 12))
options(future.globals.maxSize = 1024*10e10)
# I'm testing this on a list of eight models
system.time(posterior_table <- modelsummary(
models = soc_models_main,
output = "modelsummary_list",
statistic = "conf.int",
conf_level = 0.89,
test = NULL,
diagnostic = NULL,
metrics = "none"
))
#user system elapsed
#3.41 35.36 118.52
That said, a subsequent run only takes half the elapsed time.
# user system elapsed
# 3.00 38.02 59.07
This kind of deepens my suspicion that the performance problems are mainly an I/O issue and that the bulk of the time is spent on initially reading in the posterior draws (30,000 per model in this case) rather than actual computation of the quantities. We see similar results if we run the code without diagnostics = NULL
with and without first restarting the R session after the initial run.
# Trying parallel computation (12 core Ryzen 9 5900X)
library(future.apply)
plan(multisession(workers = 12))
options(future.globals.maxSize = 1024*10e10)
# With diagnostics without restarting the R session after the initial run
system.time(posterior_table <- modelsummary(
models = soc_models_main,
output = "modelsummary_list",
statistic = "conf.int",
conf_level = 0.89,
test = NULL,
metrics = "none"
))
# user system elapsed
# 3.70 39.58 85.23
# With diagnostics with restarting the R session
system.time(posterior_table <- modelsummary(
models = soc_models_main,
output = "modelsummary_list",
statistic = "conf.int",
conf_level = 0.89,
test = NULL,
metrics = "none"
))
# user system elapsed
# 3.16 38.22 147.08
I'm unsure if there's a software solution and given that I'm reading in the models from a Samsung 970 EVO Plus M.2 I'm not sure there's much improvement to be had in terms of I/O
I think I could profile describe_posterior()
again and see whether there are other bottlenecks that can be fixed.
In terms of where the bottle neck is regarding diagnostics
and tests
, the latter appears to be pretty efficient and doesn't add a substantial amount of time.
# With tests but without diagnostics, after restarting the R session
system.time(posterior_table <- modelsummary(
models = soc_models_main,
output = "modelsummary_list",
statistic = "conf.int",
conf_level = 0.89,
diagnostic = NULL,
metrics = "none"
))
# user system elapsed
# 3.17 36.47 124.53
If I had to guess, I'd say any gains to be had are going to come from either figuring out a more efficient way to handle reading in the draws initially or finding a more efficient implementation for calculating rhat
and ess
. As it stands now, performance is quite good if one doesn't need diagnostics or test statistics included in the modelsummary
list objects.
If the bulk of the issue is I/O, I think the simple solution is to just set diagnostic
and perhaps test
as well to NULL
by default and only call them if the user specifies them or passes arguments that would require them. This shouldn't be particularly difficult to implement on the modelsummary
side and would pretty much resolve the issue. Thoughts on this solution @vincentarelbundock?
From my perspective, there has been sufficient progress here to close this more general issue. I opened a new more specific issue with the last specific recommendation: https://github.com/vincentarelbundock/modelsummary/issues/489