stan-dev / loo

loo R package for approximate leave-one-out cross-validation (LOO-CV) and Pareto smoothed importance sampling (PSIS)
https://mc-stan.org/loo
Other
149 stars 34 forks source link

loo_moment_match(): Parallelization on Windows 10 #154

Closed fweber144 closed 4 years ago

fweber144 commented 4 years ago

At first, I thought this was related to brms (see paul-buerkner/brms#1000), but it seems this is rather an loo issue.

The problem is that loo_moment_match() (applied to a brmsfit object, though this might be unimportant) throws an error on Windows 10 (though I haven't tried it yet on other systems). On my machine, a reproducible example is:

library(brms)
library(loo)
options(mc.cores = 4) # On my machine, parallel::detectCores() returns 4.
data(roaches, package = "rstanarm")
roaches$roach1 <- roaches$roach1 / 100

roaches_fit <- brm(
  formula = y ~ roach1 + treatment + senior + offset(log(exposure2)),
  data = roaches,
  family = poisson(link = "log"),
  save_all_pars = TRUE,
  seed = 1692275530
)

roaches_loo <- loo(roaches_fit)
# There are 11 observations with a pareto_k > 0.7:
print(roaches_loo)
# So moment matching makes sense.

# With the default for argument "cores", an error is thrown:
roaches_loo <- loo_moment_match(roaches_fit,
                                loo = roaches_loo)
### ERROR:
# Error in checkForRemoteErrors(val) : 
#   4 nodes produced errors; first error: the model object is not created or not valid
# Error: Moment matching failed. Perhaps you did not set 'save_all_pars' to TRUE when fitting your model?
### 
# With "cores = 1", it works:
roaches_loo <- loo_moment_match(roaches_fit,
                                loo = roaches_loo,
                                cores = 1)

# The same behavior occurs for brms::add_criterion():
# Does not work:
roaches_fit <- add_criterion(roaches_fit,
                             criterion = "loo",
                             moment_match = TRUE)
### ERROR:
# Error in checkForRemoteErrors(val) : 
#   4 nodes produced errors; first error: the model object is not created or not valid
# Error: Moment matching failed. Perhaps you did not set 'save_all_pars' to TRUE when fitting your model?
### 
# Works:
roaches_fit <- add_criterion(roaches_fit,
                             criterion = "loo",
                             moment_match = TRUE,
                             moment_match_args = list(cores = 1))

Perhaps this issue might be related to #129 and something like

parallel::clusterExport(cl, chr_vector_of_objects_to_export, envir = environment())

might be missing in loo::loo_moment_match.default().

Perhaps this might also be related to #94 (so that something with the parallelization on Windows 10 in general is wrong).

My session info:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[...]   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] loo_2.3.1   brms_2.13.5 Rcpp_1.0.5 

loaded via a namespace (and not attached):
 [1] Brobdingnag_1.2-6    jsonlite_1.7.1       gtools_3.8.2         StanHeaders_2.21.0-6 RcppParallel_5.0.2  
 [6] threejs_0.3.3        shiny_1.5.0          assertthat_0.2.1     stats4_4.0.2         backports_1.1.9     
[11] pillar_1.4.6         lattice_0.20-41      glue_1.4.2           digest_0.6.25        checkmate_2.0.0     
[16] promises_1.1.1       colorspace_1.4-1     htmltools_0.5.0      httpuv_1.5.4         Matrix_1.2-18       
[21] plyr_1.8.6           dygraphs_1.1.1.6     pkgconfig_2.0.3      rstan_2.21.2         purrr_0.3.4         
[26] xtable_1.8-4         mvtnorm_1.1-1        scales_1.1.1         processx_3.4.4       later_1.1.0.1       
[31] tibble_3.0.3         bayesplot_1.7.2      generics_0.0.2       ggplot2_3.3.2        ellipsis_0.3.1      
[36] DT_0.15              withr_2.2.0          shinyjs_2.0.0        cli_2.0.2            magrittr_1.5        
[41] crayon_1.3.4         mime_0.9             ps_1.3.4             fansi_0.4.1          nlme_3.1-148        
[46] xts_0.12.1           pkgbuild_1.1.0       colourpicker_1.1.0   rsconnect_0.8.16     tools_4.0.2         
[51] prettyunits_1.1.1    lifecycle_0.2.0      matrixStats_0.56.0   stringr_1.4.0        V8_3.2.0            
[56] munsell_0.5.0        callr_3.4.4          compiler_4.0.2       rlang_0.4.7          grid_4.0.2          
[61] ggridges_0.5.2       rstudioapi_0.11      htmlwidgets_1.5.1    crosstalk_1.1.0.1    igraph_1.2.5        
[66] miniUI_0.1.1.1       base64enc_0.1-3      codetools_0.2-16     gtable_0.3.0         inline_0.3.16       
[71] abind_1.4-5          curl_4.3             markdown_1.1         reshape2_1.4.4       R6_2.4.1            
[76] gridExtra_2.3        rstantools_2.1.1     zoo_1.8-8            bridgesampling_1.0-0 dplyr_1.0.2         
[81] fastmap_1.0.1        shinystan_2.5.0      shinythemes_1.1.2    stringi_1.5.3        parallel_4.0.2      
[86] vctrs_0.3.4          tidyselect_1.1.0     coda_0.19-3
topipa commented 4 years ago

Thank you for reporting this! This is most likely related to https://github.com/stan-dev/loo/issues/129 as you said. The PR https://github.com/stan-dev/loo/pull/152 fixed this issue, but there is a similar parallelisation structure in loo_moment_match_i_fun , which was not updated in that PR.

I will look into it.

topipa commented 4 years ago

I think this is actually brms-related as you first guessed. I was not able to reproduce this without brms and found a solution for brms that may fix this. More details in https://github.com/paul-buerkner/brms/issues/1000 . Closing this for now at least.