Make get_index() smarter about extra time slices and area argument

seananderson commented 4 months ago

Example showing the issue below. get_index() should be smart about letting the area vector match up with the original newdata. Instead it currently expects extra 'fake' area values for the extra time slices. @Lewis-Barnett-NOAA

library(sdmTMB)

fit <- sdmTMB(
  density ~ s(depth),
  time_varying_type = 'ar1',
  time_varying = ~ 1,
  time = 'year',
  spatial = 'off',
  spatiotemporal = 'off',
  extra_time = c(2012, 2014, 2016),
  data = pcod_2011,
  family = tweedie(link = "log")
)
fit
#> Model fit by ML ['sdmTMB']
#> Formula: density ~ s(depth)
#> Mesh: NULL (isotropic covariance)
#> Time column: year
#> Data: pcod_2011
#> Family: tweedie(link = 'log')
#>  
#>             coef.est coef.se
#> (Intercept)     2.52    0.29
#> sdepth          2.85    3.33
#> 
#> Smooth terms:
#>            Std. Dev.
#> sds(depth)     13.75
#> 
#> Time-varying parameters:
#>                  coef.est coef.se
#> (Intercept)-2011     0.10    0.18
#> (Intercept)-2012     0.00    0.33
#> (Intercept)-2013     0.06    0.18
#> (Intercept)-2014     0.00    0.41
#> (Intercept)-2015     0.27    0.19
#> (Intercept)-2016     0.00    0.33
#> (Intercept)-2017    -0.42    0.20
#> 
#> Dispersion parameter: 15.02
#> Tweedie p: 1.60
#> ML criterion at convergence: 2975.614
#> 
#> See ?tidy.sdmTMB to extract these values as a data frame.

# newdata doesn't have all fitted years:
nd <- replicate_df(qcs_grid, "year", unique(pcod_2011$year))
nd$area <- 4
p <- predict(fit, newdata = nd, return_tmb_object = TRUE)
# going to error out:
ind <- get_index(p, area = nd$area)
#> Error in `get_generic()` at sdmTMB/R/index.R:79:3:
#> ! `area` should be of the same length as `nrow(newdata)` or of length 1.

# it's because of the extra_time stuff
# predicting on all years fixes it:
nd <- replicate_df(qcs_grid, "year", seq(2011, 2017))
nd$area <- 4
p <- predict(fit, newdata = nd, return_tmb_object = TRUE)
ind0 <- get_index(p, area = nd$area)
#> Bias correction is turned off.
#> It is recommended to turn this on for final inference.

# we should rethink this... but this is how you could solve the problem right now:
# newdata doesn't have all fitted years:
nd <- replicate_df(qcs_grid, "year", unique(pcod_2011$year))
nd$area <- 4
p <- predict(fit, newdata = nd, return_tmb_object = TRUE)

# this is your actual data:
head(p$data)
#>      depth depth_scaled depth_scaled2 year area       est
#> 1 347.0834    1.5608122    2.43613479 2011    4 -4.119667
#> 2 223.3348    0.5697699    0.32463771 2011    4  3.239364
#> 3 203.7408    0.3633693    0.13203724 2011    4  3.619478
#> 4 183.2987    0.1257046    0.01580166 2011    4  4.082253
#> 5 182.9998    0.1220368    0.01489297 2011    4  4.088946
#> 6 186.3892    0.1632882    0.02666303 2011    4  4.012064

# this is your 'fake' internal data for the extra time slices:
p$fake_nd
#>     X    Y    depth depth_scaled depth_scaled2 year area _sdmTMB_fake_nd_
#> 1 456 5636 347.0834     1.560812      2.436135 2012    4             TRUE
#> 2 456 5636 347.0834     1.560812      2.436135 2014    4             TRUE
#> 3 456 5636 347.0834     1.560812      2.436135 2016    4             TRUE
# these won't get used anywhere but for internal accounting of time

area <- c(p$data$area, p$fake_nd$area)

# works:
ind <- get_index(p, area = area)
#> Bias correction is turned off.
#> It is recommended to turn this on for final inference.

library(ggplot2)
ggplot(ind, aes(year, est, ymin = lwr, ymax = upr)) + geom_pointrange()


# show it matches:
ggplot(ind, aes(year, est, ymin = lwr, ymax = upr)) + geom_pointrange() +
  geom_pointrange(data = ind0, colour = "red", mapping = aes(x = year + 0.05))


ggplot(ind, aes(year, est, ymin = lwr, ymax = upr)) + geom_pointrange() +
  geom_pointrange(data = subset(ind0, year %in% seq(2011, 2017, 2)),
    colour = "red", mapping = aes(x = year + 0.05))


ind$est - ind0$est[ind0$year %in% seq(2011, 2017, 2)]
#> [1] 0 0 0 0

^{Created on 2024-03-18 with reprex v2.1.0}

seananderson commented 4 months ago

While fixing this, it might be worth finally making the TMB code smarter about only bias correcting the index values the user cares about instead of also bias correcting the missing time slices (even though they're fast because they only have 1 fake data point per time slice).

seananderson commented 3 months ago

@Lewis-Barnett-NOAA, just noting that this has been fixed and predict.sdmTMB and index integration is now smart about offset and area lengths. You will want to remove any code that uses the hack above around padding the area vector. I've also fleshed out the unit tests around this. I just did a bunch of benchmarking around not calculating the index for any years not in the prediction data frame (currently 1 row of each is included and then omitted at the end) and the time and memory use is almost identical but the code gets more complex... so I may just leave it as is.

pbs-assess / sdmTMB

Make get_index() smarter about extra time slices and area argument #323