predicting to out of group samples in inlabru

njtierney commented 3 years ago

Hi there,

We (@goldingn and myself) were wondering if and how inlabru handles out of sample prediction - so what happens if we get new groups added to the data and try and predict to them.

The prediction works, however, the predicted values we get appear to be identical to the last group in the series? We were wondering if perhaps there is an issue with how we've written the prediction, or if inla/inlabru doesn't do out of sample prediction?

library(inlabru)
#> Loading required package: ggplot2
#> Loading required package: sp
options(tidyverse.quiet = TRUE)
library(tidyverse)

simulate data

set.seed(2021-06-11)
n <- 100
n_group <- 15
n_time <- 10
df <- tibble(
  x = sample.int(n_time, n, replace = TRUE),
  z = sample.int(n_group, n, replace = TRUE)
)

true_int <- rnorm(1)
true_coef <- rnorm(n_group, 0, 10)
df$y <- rnorm(n, true_int + true_coef[df$z])

ggplot(df,
       aes(x = x,
           y = y,
           colour = factor(z))) +
  geom_line() +
  theme(legend.position = "none")

# fit model a model with timeseries (on x) correlated between groups (z)
m <- bru(
  y ~ time(x,
         model = "ar1",
         group = z,
         constr = FALSE),
  data = df
)

# predict from model to the final time point, but for two new groups
df_pred <- data.frame(
  x = n_time,
  z = seq_len(n_group + 2)
)

fit <- predict(
  object = m,
  data = df_pred,
  formula = ~ c(pred = Intercept + time_eval(x, group = z)),
  n.samples = 1e3
)

fit
#>         x  z         mean        sd     q0.025      median       q0.975
#> pred1  10  1   5.95374331 0.6707510   4.704009   5.9504185   7.31650644
#> pred2  10  2  -0.09927763 0.6751873  -1.501106  -0.1037973   1.30383497
#> pred3  10  3   8.24582869 0.7015647   6.872147   8.2465216   9.78078744
#> pred4  10  4   4.65542047 0.5949204   3.458076   4.6373079   5.86436078
#> pred5  10  5   5.12999116 0.7107811   3.775194   5.1339094   6.79557376
#> pred6  10  6  -2.30246353 0.7815665  -3.934392  -2.2705040  -0.81721817
#> pred7  10  7 -18.44009347 0.7411492 -19.964386 -18.4337909 -16.92573363
#> pred8  10  8  -5.24929762 0.6539726  -6.604049  -5.2146644  -4.05963941
#> pred9  10  9  -1.84045714 0.8227068  -3.557010  -1.8280683  -0.25265695
#> pred10 10 10 -10.24198546 0.5572366 -11.293369 -10.2246455  -9.13210165
#> pred11 10 11   6.89537741 0.8344171   5.265169   6.8699004   8.59720718
#> pred12 10 12  -0.87476387 0.4568525  -1.800008  -0.8654010   0.04481905
#> pred13 10 13  -9.54389540 0.5433422 -10.627809  -9.5535491  -8.46563992
#> pred14 10 14   8.17887443 0.4648997   7.254812   8.1641717   9.11569642
#> pred15 10 15   7.31962759 0.5785128   6.149535   7.3193975   8.43502126
#> pred16 10 16   7.31962759 0.5785128   6.149535   7.3193975   8.43502126
#> pred17 10 17   7.31962759 0.5785128   6.149535   7.3193975   8.43502126
#>              smin        smax          cv       var
#> pred1    3.246321   9.0598903  0.11266039 0.4499070
#> pred2   -2.284192   2.1222578 -6.80100201 0.4558779
#> pred3    5.188185  10.7981484  0.08508117 0.4921931
#> pred4    2.365797   6.9292645  0.12779090 0.3539302
#> pred5    2.726562   7.6063475  0.13855406 0.5052098
#> pred6   -5.463239   0.3067262 -0.33944793 0.6108462
#> pred7  -21.421996 -15.4893786 -0.04019227 0.5493022
#> pred8   -7.762566  -3.3007031 -0.12458288 0.4276802
#> pred9   -4.852435   1.1661020 -0.44701221 0.6768465
#> pred10 -11.919811  -8.5754744 -0.05440709 0.3105126
#> pred11   3.363856  10.3879913  0.12101109 0.6962520
#> pred12  -2.367696   0.4215385 -0.52225807 0.2087142
#> pred13 -11.504200  -7.7109795 -0.05693087 0.2952208
#> pred14   6.812064   9.9946395  0.05684152 0.2161317
#> pred15   5.054687   9.1597521  0.07903582 0.3346770
#> pred16   5.054687   9.1597521  0.07903582 0.3346770
#> pred17   5.054687   9.1597521  0.07903582 0.3346770

The two new groups appear to be identical to the last one ?

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.0 (2021-05-18) #> os macOS Big Sur 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Perth #> date 2021-06-14 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) #> broom 0.7.6 2021-04-05 [1] CRAN (R 4.1.0) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0) #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.1.0) #> colorspace 2.0-1 2021-05-04 [1] CRAN (R 4.1.0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> curl 4.3.1 2021-04-30 [1] CRAN (R 4.1.0) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) #> dplyr * 1.0.6 2021-05-05 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0) #> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) #> ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) #> haven 2.4.1 2021-04-23 [1] CRAN (R 4.1.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) #> INLA 21.05.02 2021-05-03 [1] local #> inlabru * 2.3.1 2021-03-22 [1] CRAN (R 4.1.0) #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) #> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0) #> lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.1.0) #> MatrixModels 0.5-0 2021-03-02 [1] CRAN (R 4.1.0) #> mime 0.10 2021-02-13 [1] CRAN (R 4.1.0) #> mnormt 2.0.2 2020-09-01 [1] CRAN (R 4.1.0) #> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) #> numDeriv 2016.8-1.1 2019-06-06 [1] CRAN (R 4.1.0) #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.1.0) #> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.1.0) #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rmarkdown 2.8 2021-05-07 [1] CRAN (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> rvest 1.0.0 2021-03-09 [1] CRAN (R 4.1.0) #> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> sn 2.0.0 2021-03-28 [1] CRAN (R 4.1.0) #> sp * 1.4-5 2021-01-10 [1] CRAN (R 4.1.0) #> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.1.0) #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.1.0) #> tibble * 3.1.2 2021-05-16 [1] CRAN (R 4.1.0) #> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0) #> tmvnsim 1.0-2 2016-12-15 [1] CRAN (R 4.1.0) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> xfun 0.23 2021-05-15 [1] CRAN (R 4.1.0) #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library ```

goldingn commented 3 years ago

Just to add: we tried an IID (random intercept) version of this model adding new groups, and that worked as expected, sampling the new group values from the distribution. I see that that IID out of sample prediction is documented in predict.bru(). Mostly hoping there's a trick to doing this with more complicated models like this group AR1 and that we won't have to extract components and code it up ourselves :)

finnlindgren commented 3 years ago

Out-of-sample prediction on the observation level is much easier than out-of-sample prediction for the latent field model.

INLA doesn't really implement observation level prediction at all, but in inlabru we can include it in the Monte Carlo part of the predict method without problem.

For predictions involving latent field components that were not part of the original model specification, the code would need to somehow determine how to automatically extend the definition domain and precision matrix structure for new nodes. This is not a generally well-defined problem!

The practical problem with auto-extending the latent components has two main issues:

there is no universal way to extend general latent models to new "locations". What is natural for one problem may not be the right thing for another problem.
The inla software sets up the latent model at the estimation stage. Traditionally, prediction/extension/extrapolation on the latent level therefore needs to be done when setting up the initial problem. We do this implicitly with spatial spde models; the mesh determines the extent of the spatial model, regardless of where the observations are, which allows full spatial prediciton within the pre-defined region. For time series models, the same technique can be used; when defining the model component, include the domain for which prediction is need (in inlabru, specifying a suitable mapper argument can partially automate this)

A couple of months ago, support for extending "iid" models to unseen indices was added to inlabru, since for that model there is really only one natural extension, and it's a really easy one to implement. For other models, even those with relatively natural extension, the setup needed is more complicated. A potential future solution is if someone has time to implement a sub-interface to inla for just constructing the latent model information as raw precision matrices. However, there are many features that would cause complications. E.g. for intrinsic random field models, it is extremely common to impose sum-to-zero or integrate-to-zero constraints, and these would also need to be handled (to avoid changing the meaning of the model they should apply only to the original nodes).

TL;DR; The easiest general solution (both in INLA and in inlabru) is to include the entire domain in the original latent model definition, covering both the observations and the domain intended for prediction.

The "prediction gave us a repeated single value" effect is due to how the default mapper information is constructed internally. In the future, this may be tweaked a bit to generate warnings/errors in these situations, to force the user to explicitly ask for that behaviour.

goldingn commented 3 years ago

Thanks Finn, that makes sense and I appreciate the detailed response!

For our purposes, we could try to set up the right simulation for grouped AR1 models (either on our end or as a PR to inlabru). But it may be a lot easier to have the users specify the full set of groups and timesteps in advance. Easier is probably what we should go with for this project.

finnlindgren commented 3 years ago

Even if you only need it for group-ar1 models, a general solution would still need to deal with the whole set of main models. You can likely do it for a specific combination of main model with group-ar1, but the general case essentially would need to replicate a large chunk of the internal INLA code for setting up the precision matrices etc... Much easier to define the initial model to include the prediction domain; this also would leave no questions about the intended model definition. It would potentially involve handling constraints; this is one of the issues with intrinsic stationary random field models; very few people seem to realise the full impact of the intricate theory underpinning those models... But if you don't have any intrinsic stationary models that need extending, you should be fine. A regular Matern spde field is fine, as is ar1, for example, but rw1 and rw2 require more care.

If you have other feature requests/thoughts, please let me know. I will be doing some work on multi-likelihood models later this summer, as well as some other details to allow access to the full range of INLA models. This will involve extending the interface a bit, so it would be good to be aware of potential needs when designing the details of that. I tend to work from an interface design point of view, such as "I'd like to be able to specify this feature in this way, but also provide a coherent interface design that doesn't contradict itself"; the latest CRAN updates were all about changing the internal structure of inlabru to allow more extensions.

finnlindgren commented 3 years ago

A potential route to a general solution to this problem could use the same idea as we've been thinking of for related problems such as sampling from the priors; To take a defined inlabru model and adjust aspects of it before calling inla to do the actual computations. This could in principle be done for the "extend the latent models in a specific way"; practically, one could then first do the estimation using the smaller model, then extend it, and run just the final prediction stage of inla (since the posterior mode would already be known). This may be the easiest way to automate this for general models, but requires some knowledge of the inlabru and inla model specification and input/output storage formats.

ASeatonSpatial commented 7 months ago

This topic has come up for me when writing a linear mixed model example for some inlabru materials I am writing. In the example we have observed groups 1:5 with a simple iid random effect model. I want to include a way to predict for unobserved group 6. One option is to sample the posterior hyperparameters for the effect and generate posterior samples "by hand" as it were.

Or, perhaps this is easier, I could recommend augmenting the data used for fitting by including rows with NA response data corresponding to the unobserved prediction groups. If you do this then you get access to the INLA marginals for the unobserved groups, and you can sample them using generate() as though they had been observed. This seems cleaner to me than samping hyper parameters and going from there, it's straightforward for the iid or ar1 model but I can see it getting complicated very quickly.

inlabru-org / inlabru

predicting to out of group samples in inlabru #117