pbs-assess / sdmTMB

:earth_americas: An R package for spatial and spatiotemporal GLMMs with TMB
https://pbs-assess.github.io/sdmTMB/
183 stars 26 forks source link

Prediction fails when both `newdata` and `offset` are provided and model includes `extra_time`. #270

Closed sebpardo closed 10 months ago

sebpardo commented 10 months ago

The function predict.sdmTMB() throws an error when both newdata and offset are provided for a model with values for extra_time.

library(sdmTMB)

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)

pcod$os <- rep(log(0.01), nrow(pcod)) # offset

m <- sdmTMB(
  data = pcod,
  formula = density ~ poly(log(depth), 2),
  mesh = mesh,
  offset = pcod$os,
  family = tweedie(link = "log"),
  spatial = "on",
  time = "year",
  extra_time = c(2006, 2008, 2010, 2012, 2014, 2016), 
  spatiotemporal = "AR1"
)

p1 <- predict(m, offset = pcod$os) # This works
p2 <- predict(m, newdata = pcod, offset = pcod$os) # This fails
# Error in `predict()`:
# ! Prediction offset vector does not equal number of rows in prediction dataset.
# Run `rlang::last_trace()` to see where the error occurred.
p3 <- predict(m, newdata = pcod) # This works

Interestingly, you can make it work when "padding" the offset by adding an extra row for each extra_time value:

p4 <- predict(m, newdata = pcod, offset = c(pcod$os, rep(log(0.01), 6))) # This works

Also works when no extra_time is provided:

m2 <- sdmTMB(
  data = pcod,
  formula = density ~ poly(log(depth), 2),
  mesh = mesh,
  offset = pcod$os,
  family = tweedie(link = "log"),
  spatial = "on",
  time = "year",
  spatiotemporal = "AR1"
)

p5 <- predict(m2, newdata = pcod, offset = pcod$os) # This works

I found it informative that the prediction for p1 has six more rows than all other predictions (one for each extra_time), which is especially interesting given that p4 has a longer offset:

dim(p1)
# [1] 2149   19
dim(p3)
# [1] 2143   18
dim(p4)
# [1] 2143   18
dim(p5)
# [1] 2143   18

I think the problematic padding of newdata occurs here and here.

I'm not sure why there is the need to add "fake newdata"(_sdmTMB_fake_nd_ = TRUE) to newdata when there are extra_time values if there are no rows with those timesteps present in newdata? I first thought the easiest solution would be to pad the offset vector internally at the same time as newdata, but then I wondered why you'd want those extra six rows in your prediction output? Not sure what the ideal solution should look like yet.

seananderson commented 10 months ago

Thanks! Fixed. Keeping all time slices in the data at all times makes the bookkeeping simpler within the .cpp.