pbs-assess / sdmTMB

:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
https://pbs-assess.github.io/sdmTMB/
183 stars 26 forks source link

cross-validation vignette needs updating #259

Open Lewis-Barnett-NOAA opened 11 months ago

Lewis-Barnett-NOAA commented 11 months ago

There is no elpd object created in most recent version, but this is a metric that the vignette focuses on and expects to be in the model object

Lewis-Barnett-NOAA commented 11 months ago

also, this model won't converge, with a non-positive-definite Hessian:

test_years <- c(2011, 2013, 2015, 2017)

models <- list() log_lik <- list() for (i in 1:length(test_years)) { clust <- rep(1, nrow(pcod)) clust[which(pcod$year < test_years[i])] <- 2

models[[i]] <- sdmTMB_cv( density ~ 0 + s(depth_scaled), data = pcod, mesh = mesh, spatiotemporal = "off", fold_ids = clust, family = tweedie(link = "log"), k_folds = length(unique(clust)) )

log_lik[[i]] <- sum(m_cv$data$cv_loglik[which(m_cv$data$year == test_years[i])]) }

seananderson commented 11 months ago

I made some quick fixes and also switched eval=TRUE so these things will be detected in the future. It's probably worth restructuring so that model comparison comes first since that is the main point of cross validation. I removed the old LFOCV section. I started adding in a new example, but then realized that I am totally confused by how these arguments work:

library(sdmTMB)
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)
m_lfocv <- sdmTMB_cv(
  present ~ s(year, k = 3),
  data = pcod,
  mesh = mesh,
  lfo = TRUE, # do LFOCV
  lfo_forecast = 2, # number of time steps to forecast
  lfo_validations = 3, # number of times to validate
  family = binomial(),
  spatiotemporal = "off",
  time = "year" # must be specified
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

# See how the LFOCV folds were assigned:
example_data <- m_lfocv$models[[2]]$data
table(example_data$cv_fold, example_data$year)
#>    
#>     2003 2004 2005 2007 2009 2011 2013 2015 2017
#>   1  232  230  224  255  233    0    0    0    0
#>   2    0    0    0    0    0  251    0    0    0
#>   3    0    0    0    0    0    0  240    0    0
#>   4    0    0    0    0    0    0    0  238    0
#>   5    0    0    0    0    0    0    0    0  240

Created on 2023-09-29 with reprex v2.0.2

Here is what I would have expected:

An example of LFOCV with 9 time steps, lfo_forecast = 2, and lfo_validations = 3: Fit data to time steps 1 to 5, predict and validate step 7. Fit data to time steps 1 to 6, predict and validate step 8. Fit data to time steps 1 to 7, predict and validate step 9.

One thing I'm not sure is if the time steps are 2007, 2008, 2009 etc. or 2007, 2009, 2011 (the steps provided). But I'm also just not sure how lfo_forecast and lfo_validations correspond to the table() above.

ericward-noaa commented 11 months ago

Sorry I was slow here, see https://github.com/pbs-assess/sdmTMB/pull/262

For your example, of LFOCV with 9 time steps, lfo_forecast = 2, and lfo_validations = 3, the model is doing: Fit data to time steps 1 to 5, predict and validate step 7. Fit data to time steps 1 to 6, predict and validate step 8. Fit data to time steps 1 to 7, predict and validate step 9.

This can be seen in a couple places in the code, e.g. https://github.com/pbs-assess/sdmTMB/blob/e960d1e68069e7a0e370d0e612c78097d9fb53e5/R/cross-val.R#L335

The time steps used are those provided (2007, 2009, 2011).

Here's an example of how the table is made, using a snippet from sdmTMB_cv:

data <- pcod
time <- "year"
data$cv_fold <- 1
t_validate <- sort(unique(data[[time]]), decreasing = TRUE)
lfo_validations <- 5
lfo_forecast <- 1
for (t in seq(1, lfo_validations)) {
  # fold id increasing order + forecast
  data$cv_fold[data[[time]] == t_validate[t]] <- lfo_validations - t + 1 + lfo_forecast
}
data$cv_fold <- as.numeric(as.factor(data$cv_fold))

table(data$cv_fold, data$year)