pbs-assess / sdmTMB

:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
https://pbs-assess.github.io/sdmTMB/
187 stars 27 forks source link

passing family as variable to sdmTMB_cv #127

Closed seanhardison1 closed 1 year ago

seanhardison1 commented 2 years ago

Hi,

I was wondering if anyone had any insights into how I might pass a family variable to sdmTMB_cv. It seems to work with sdmTMB() but fails for the CV version. For example,

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)

# Set parallel processing first if desired with the future package.
# See the Details section above.

fam <- tweedie(link = "log")

# works
m <- sdmTMB(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = fam
)

# works
m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"), k_folds = 2
)

# doesn't work
m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = fam, k_folds = 2
)

Thanks!

seananderson commented 2 years ago

I think that's related to this older issue: https://github.com/pbs-assess/sdmTMB/issues/54

seananderson commented 2 years ago

Strangely, I can't reproduce the issue. (I did just push a fix for the annoying printing of the sdmTMB_cv() models where the data and mesh would be printed too.)

library(sdmTMB)

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)

fam <- tweedie(link = "log")

m <- sdmTMB(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = fam
)

m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = fam, k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

m_cv$models[[1]]
#> Spatial model fit by ML ['sdmTMB']
#> Formula: density ~ 0 + depth_scaled + depth_scaled2
#> Family: tweedie(link = 'log')
#>  
#>               coef.est coef.se
#> depth_scaled     -1.76    0.21
#> depth_scaled2    -1.27    0.10
#> 
#> Dispersion parameter: 13.68
#> Tweedie p: 1.60
#> Matern range: 108.62
#> Spatial SD: 2.73
#> ML criterion at convergence: 3177.450
#> 
#> See ?tidy.sdmTMB to extract these values as a data frame.
m_cv$models[[2]]
#> Spatial model fit by ML ['sdmTMB']
#> Formula: density ~ 0 + depth_scaled + depth_scaled2
#> Family: tweedie(link = 'log')
#>  
#>               coef.est coef.se
#> depth_scaled     -1.48    0.16
#> depth_scaled2    -1.29    0.11
#> 
#> Dispersion parameter: 13.99
#> Tweedie p: 1.61
#> Matern range: 187.00
#> Spatial SD: 2.24
#> ML criterion at convergence: 3329.646
#> 
#> See ?tidy.sdmTMB to extract these values as a data frame.

Created on 2022-08-30 with reprex v2.0.2

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.3.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.14   knitr_1.39        magrittr_2.0.3    R.cache_0.16.0   
#>  [5] rlang_1.0.4       fastmap_1.1.0     fansi_1.0.3       stringr_1.4.1    
#>  [9] styler_1.7.0      highr_0.9         tools_4.2.1       xfun_0.32        
#> [13] R.oo_1.25.0       utf8_1.2.2        cli_3.3.0         withr_2.5.0      
#> [17] htmltools_0.5.3   yaml_2.3.5        digest_0.6.29     tibble_3.1.8     
#> [21] lifecycle_1.0.1   purrr_0.3.4       R.utils_2.12.0    vctrs_0.4.1      
#> [25] fs_1.5.2          glue_1.6.2        evaluate_0.16     rmarkdown_2.15   
#> [29] reprex_2.0.2      stringi_1.7.8     compiler_4.2.1    pillar_1.8.1     
#> [33] R.methodsS3_1.8.2 pkgconfig_2.0.3

Created on 2022-08-30 with reprex v2.0.2

seanhardison1 commented 2 years ago

Thanks Sean. Okay I think I figured out what's triggering the error. I was able to run the code I posted previously in a fresh R session, but it breaks down if I re-run sdmTMB_cv after calling plan(multisession):

fam <- tweedie(link = "log")
plan(multisession)
m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = fam, k_folds = 2
)

Can you confirm? I don't know enough about how the futures library works to know what's causing the error.

seanhardison1 commented 2 years ago

Not sure if this is helpful, but the issue could be related to fam being a global variable that needs to be passed explicitly to the future call. This (unfortunately verbose) code will run, but seems to only evaluate sequentially despite plan(multisession) being called

library(future)
library(sdmTMB)
plan(multisession)

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)
fam <- tweedie()

globals <- list(fam = fam,
                pcod = pcod, 
                mesh = mesh,
                sdmTMB_cv = sdmTMB_cv)

f <- 
  future({
    sdmTMB_cv(
      density ~ 0 + depth_scaled + depth_scaled2,
      data = pcod, mesh = mesh,
      family = fam, k_folds = 8
    )
}, globals = globals)

m_cv <- value(f)
m_cv

New sessions are opened but the code only evaluates in a single other session:

image

Compared to the simpler

plan(multisession)
m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"), k_folds = 5
)

which will execute in k sessions

image
sessioninfo: ``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] future_1.27.0 readxl_1.4.1 mgcv_1.8-40 nlme_3.1-157 pals_1.7 [6] ggtext_0.1.1 gganimate_1.0.7 lubridate_1.8.0 tictoc_1.0.1 patchwork_1.1.2 [11] raster_3.5-29 fasterize_1.0.3 magrittr_2.0.3 INLA_22.05.07 sp_1.5-0 [16] foreach_1.5.2 Matrix_1.4-1 sf_1.0-8 sdmTMB_0.1.0 forcats_0.5.2 [21] stringr_1.4.1 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 [26] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2 loaded via a namespace (and not attached): [1] googledrive_2.0.0 minqa_1.2.4 colorspace_2.0-3 ellipsis_0.3.2 [5] class_7.3-20 rgdal_1.5-32 rprojroot_2.0.3 estimability_1.4.1 [9] fs_1.5.2 gridtext_0.1.4 dichromat_2.0-0.1 rstudioapi_0.14 [13] proxy_0.4-27 glmmTMB_1.1.4 listenv_0.8.0 farver_2.1.1 [17] fansi_1.0.3 mvtnorm_1.1-3 xml2_1.3.3 codetools_0.2-18 [21] splines_4.2.1 dream_0.0.0.9000 jsonlite_1.8.0 nloptr_2.0.3 [25] broom_1.0.1 anytime_0.3.9 dbplyr_2.2.1 fishMod_0.29 [29] mapproj_1.2.8 compiler_4.2.1 httr_1.4.4 emmeans_1.8.0 [33] backports_1.4.1 assertthat_0.2.1 gargle_1.2.0 cli_3.3.0 [37] tweenr_2.0.1 prettyunits_1.1.1 tools_4.2.1 coda_0.19-4 [41] gtable_0.3.0 glue_1.6.2 maps_3.4.0 Rcpp_1.0.9 [45] cellranger_1.1.0 vctrs_0.4.1 iterators_1.0.14 globals_0.16.1 [49] lme4_1.1-30 rvest_1.0.3 lifecycle_1.0.1 googlesheets4_1.0.1 [53] terra_1.6-7 tsibble_1.1.2 MASS_7.3-57 scales_1.2.1 [57] hms_1.1.2 TMB_1.9.1 mvnfast_0.2.7 DHARMa_0.4.5 [61] stringi_1.7.8 e1071_1.7-11 gratia_0.7.3 boot_1.3-28 [65] rlang_1.0.4 pkgconfig_2.0.3 lattice_0.20-45 tidyselect_1.1.2 [69] parallelly_1.32.1 here_1.0.1 R6_2.5.1 generics_0.1.3 [73] DBI_1.1.3 pillar_1.8.1 haven_2.5.1 withr_2.5.0 [77] units_0.8-0 future.apply_1.9.0 modelr_0.1.9 crayon_1.5.1 [81] KernSmooth_2.23-20 utf8_1.2.2 tzdb_0.3.0 progress_1.2.2 [85] grid_4.2.1 digest_0.6.29 reprex_2.0.2 classInt_0.4-7 [89] xtable_1.8-4 numDeriv_2016.8-1.1 munsell_0.5.0 ```
ericward-noaa commented 2 years ago

I'm just seeing this thread, and confirmed the example still isn't working. Looks like the family argument is not being parsed correctly -- and crashes here https://github.com/pbs-assess/sdmTMB/blob/aad58e0a11f93abd987aca68a52d820b87895f56/R/fit.R#L554

Not perfect, but one workaround that I confirmed works is to use use_initial_fit = TRUE. This just fits the model to the first fold, and uses those estimates as starting values for other folds. The family argument also gets passed correctly. I'll look into why the other approach isn't working

m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = fam, k_folds = 2,
  use_initial_fit = TRUE
)
seananderson commented 1 year ago

For some reason this seems to be working for me now, but I've seen other issues with future not detecting globals. I've now added an argument future_globals in sdmTMB_cv() that passes a character vector of global objects to the future package if needed: https://github.com/pbs-assess/sdmTMB/blob/95d9f88847dd598ee3c3524ca3ba7f0dba2f0cab/R/cross-val.R#L101-L105

https://github.com/pbs-assess/sdmTMB/blob/95d9f88847dd598ee3c3524ca3ba7f0dba2f0cab/R/cross-val.R#L421-L429