pbs-assess / sdmTMB

:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
https://pbs-assess.github.io/sdmTMB/
182 stars 26 forks source link

Add RMSE and MAE sdmTMB_cv examples to cross validation vignette #322

Open seanhardison1 opened 4 months ago

seanhardison1 commented 4 months ago

Hello! I'm a big fan of the sdmTMB_cv function, but I find the likelihood-based measures that are currently returned with the function to be a bit unintuitive. For a future iteration, it would be nice to have the function also return RMSE and MAE (by fold? averaged across folds?) in addition to the likelihood-based measures, or perhaps allow for user defined functions.

seananderson commented 4 months ago

We can certainly get some common ones in there and/or allow a user-specified function. In the meantime, it's relatively easy to calculate whatever you need yourself with the cv_predicted column. The log likelihood of the left-out data has the advantage of being consistent with the what the model is trying to maximize in the data portion. It is similar to the log score (not sure what the classic paper is, but Draper often wrote about it) or the ELPD in Bayesian stats. But, yes, I can see the benefit of calculating other classic measures on the scale of the data.

library(sdmTMB)
library(dplyr)

m <- sdmTMB_cv(
  density ~ depth_scaled + depth_scaled2,
  data = pcod, mesh = make_mesh(pcod, c("X", "Y"), cutoff = 25),
  family = tweedie(link = "log"), k_folds = 3
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

names(m$data)
#>  [1] "year"          "X"             "Y"             "depth"        
#>  [5] "density"       "present"       "lat"           "lon"          
#>  [9] "depth_mean"    "depth_sd"      "depth_scaled"  "depth_scaled2"
#> [13] "cv_fold"       "cv_predicted"  "cv_loglik"

# see cv_predicted for the prediction within each left-out fold

d <- m$data

# RMSE:
sqrt(mean((d$density - d$cv_predicted)^2)) 
#> [1] 194.1635

# MAE:
mean(abs(d$density - d$cv_predicted))
#> [1] 49.19603

# by fold:
group_by(d, cv_fold) |> 
  summarize(
    rmse = sqrt(mean((density - cv_predicted)^2)),
    mae = mean(abs(density - cv_predicted))
  )
#> # A tibble: 3 × 3
#>   cv_fold  rmse   mae
#>     <int> <dbl> <dbl>
#> 1       1 176.   47.4
#> 2       2 270.   55.5
#> 3       3  87.8  44.5

Created on 2024-03-18 with reprex v2.1.0

seanhardison1 commented 4 months ago

Great, thank you Sean! Seeing how easy it is to do this with the current function, I think just adding this example to the CV vignette would be a fine alternative.

seananderson commented 4 months ago

Good idea. I've re-opened the issue and renamed it.