inlabru-org / inlabru

inlabru
https://inlabru-org.github.io/inlabru/
73 stars 20 forks source link

Joint modeling of point and areal data #125

Open bradleyswilson opened 2 years ago

bradleyswilson commented 2 years ago

Greetings, I've been working with INLA for joint modeling of point and areal data (similar to the model in this paper).

I'm looking to transferring the model over to inlabru to help handle large numbers of predictions, but am struggling to figure out how I'd implement the model specification.

Is the preferred approach to use the like() and bru() functionality to specify two likelihoods, one for the point data and another for the polygons? When I pass spatial polygons directly to a field(polygons, model = matern), I get an a cannot coerce class ‘structure("Polygons", package = "sp")’ to a data.frame error. I'm looking for it to be constructed similar to the inla.spde.make.block.A from the normal INLA package.

finnlindgren commented 2 years ago

Automated support for field(polygons, ...) syntax (or something equivalent, as usually the integration should be done on the natural scale and not on the log-scale, and for the combined predictor, so the component specification is usually not the appropriate place for this) for count data hasn't been done yet; in principle, it should become similar to the model="cp" syntax with a samplers specification for the polygons. For now, it can be done with some manual intervention, in several ways; I'll sketch one below, using inlabru::ipoints, but a PhD student is working on this more generally. For multiple likelihoods, yes, the idea is to define the components jointly, and then specify each likelihood with a separate like() call.

Basic ipoints() results for polygons is to construct an integration scheme for the combination of the polygons. To keep them separated, the group option needs to be specified, below assuming that polygons is a SpatialPolygonsDataFrame object where the data.frame part has a variable polygonID (there may be other ways):

ips <- ipoints(samplers=polygons, domain=mesh, group="polygonID")

Then we need a way to apply this to the predictor.

A <- inla.spde.make.A(
  mesh,
  coordinates(ips),
  block=ips$polygonID,
  block.rescale = "none",
  weights = ips$weight)

Then the like() call would look something like this:

like(formula = response ~ log(A %*% exp(here + goes + your + predictor + expression)),
       model = "poisson",
      data = ips,
      allow_combine = TRUE)

Any non-spatial covariates would need to be added to ips, and any spatial covariates handled via one of the covar(griddataframe, model="linear") or covar(covarevaluatorfunction, model="linear"), see examples at https://inlabru-org.github.io/inlabru/articles/web/2d_lgcp_covars.html

I'm sure there are more details to work out depending on your model details, but the principle is given above. Apart from internal as yet unpublished projects we haven't really shown how to do log(A %*% exp(eta)) models yet; existing examples in the literature tend to use A %*% eta instead, which is cheaper (and keeps the familiar log-linear predictor structure) but not strictly a correct approximation of aggregation of Poisson count models.

There's also a snag with the construction above, as I'm almost certain it doesn't allow you to specify the response variable, as it has different length than ips. For that, I believe you currently need to give input data as a list, which I just realised breaks some of the automated coordinate handling. If you don't have any spatial covariates, or write all covariate access via functions instead of SpatialGridDataFrame object, then you can fix it by using

data = c(list(response = data$response), as.list(as.data.frame(ips)))

and use explicit cbind(easting, northing) in your component input specifications instead of the semi-automated coordinates approach; note that this should also work for the other likelihoods; the coordinates thing is just a sometimes-convenient shorthand.

This should become a bit easier to deal with once I add a separate response variable argument to like(), similarly to how inla.stack() already handles that, as then the "right-hand-side" input data can stay a spatial object

finnlindgren commented 2 years ago

Additional note: the proper count model aggregation is only well-defined for Poisson, as Po(lambda1)+Po(lambda2)~Po(lambda1+lambda2). For other models, including Binomial with different probabilities, this doesn't hold, so there's no longer a strong theoretical argument saying log(A exp(eta)) is necessarily more correct than A eta; instead it's a research problem to work out what is a sensible approach.

finnlindgren commented 2 years ago

I expect to wrap the basic version of this into a like() model soon, as it works almost identically to how model="cp" (which the convenience wrapper lgcp() uses) is implemented; that would also allow it to get the response variable information from the SpatialPolygonsDataFrame instead.

bradleyswilson commented 2 years ago

Thanks for the quick and detailed response. I'm working with Gaussian observations rather than a count model aggregation. If I'm understanding the approach correctly, my version would look something like:

ips <- ipoints(samplers=polygons, domain=mesh, group="polygonID")

A <- inla.spde.make.A(
  mesh,
  coordinates(ips),
  block=ips$polygonID,
  block.rescale = "sum")

like(formula = response ~ A %*% (my + predictor + expression),
       model = "gaussian",
      data = ips,
      allow_combine = TRUE)

I then define a second like() for the point values, something like:

like(formula = response ~  my + predictor + expression + field(pt_coordinates, model=matern),
       model = "gaussian",
      data = pts)

and combine them with bru(jcmp, lik1, lik2)

finnlindgren commented 2 years ago

Yes, for Gaussian observations it's also fine! You'll still need to have the component specification separate from the formulas, and use the list data construction for the aggregating model; something like

comp <- ~ Intercept1(1) + Intercept2(1) +
      covariate(eval_covar(cbind(easting, northing)), model = "linear") +
      field(pt_coordinates, model=matern),
like1 <-
  like(formula = response1 ~ A %*% (Intercept1 + covariate),
       family = "gaussian",
      data = c(list(response1 = data$response1), as.list(as.data.frame(ips))),
      allow_combine = TRUE)
like2 <-
  like(formula = response2 ~ .,
      include = c("Intercept2", "field")
       family = "gaussian",
      data = pts)
bru(comp, like1, like2)

For this to work, the coordinate names in pts and polygons need to be set to known values (easting and northing should be replaced by whatever you have, or you need to change what you have named them). The inlabru devel version is meant to preserve those names when needed and possible (like in ipoints(); use the name argument to override existing names).

bradleyswilson commented 2 years ago

In testing a simple example, I can't seem to get around a differing number of rows error when trying to index the spde object for the polygons: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 14271, 17369

For reference, In my existing INLA code, I'm doing something like this, with different A matrices for points and polys:

s.index <- inla.spde.make.index(name="spatial.field", n.spde=spde$n.spde)
# data and poly_data are data.frames
stack.pts <- inla.stack(
                        data=list(response = data$response),
                        A=list(A.pts,1),
                        effects=list(c(s.index, list(Intercept=1)),
                                      list(covar = data$covar)))

stack.poly <- inla.stack(
                        data=list(response = poly_data$response),
                        A=list(A.poly,1),
                        effects=list(c(s.index, list(Intercept=1)),
                                      list(covar = poly_data$covar)))

formula ~ Intercept + f(spatial.field, model=spde)

The A matrix you've suggested for the aggregation is identical to the one I'm already using, so I'm not sure where the issue is arising. Manually running A %*% spatial.field runs just fine. Do I need a custom mapper to access the spde object with a list of indices rather than coordinate pairs?

comp <- ~ Intercept1(1) + field(spatial.field, model=spde)
like1 <-
  like(formula = response ~ A %*% (Intercept1 + field),
       family = "gaussian",
       data = c(list(response = data$response), s.index),
       allow_combine = TRUE)
fit <- bru(comp, like1)
finnlindgren commented 2 years ago

Aha, I see the problem; I mixed two different approaches.

In your Gaussian/linear case, you can either change the A matrix to handle it correctly, or set the field component mapper to bru_mapper_index(matern$n.spde). The downside of the latter approach is that it affects both likelihoods, so the other one would then also need an A matrix. Correcting the A mtatrix is therefore a generally better method here, I think.

In the nonlinear case (Poisson) the correct solution is to change the A matrix. Instead of inla.spde.make.A it should actually be a much simpler matrix, where each row contains the integration point weight values corresponding to each “block”.

I think this should do it:

A = sparseMatrix(i=ips$polygonid, j=seq_len(NROW(ips)),x=ips$weight, dims=c(max(ips$polygonid), NROW(ips)))

Finn

On 17 Nov 2021, at 18:15, Bradley Wilson @.***> wrote:

 In testing a simple example, I can't seem to get around a differing number of rows error when trying to index the spde object for the polygons: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 14271, 17369

For reference, In my existing INLA code, I'm doing something like this, with different A matrices for points and polys:

s.index <- inla.spde.make.index(name="spatial.field", n.spde=spde$n.spde)

data and poly_data are data.frames

stack.pts <- inla.stack( data=list(response = data$response), A=list(A.pts,1), effects=list(c(s.index, list(Intercept=1)), list(covar = data$covar)))

stack.poly <- inla.stack( data=list(response = poly_data$response), A=list(A.poly,1), effects=list(c(s.index, list(Intercept=1)), list(covar = poly_data$covar)))

formula ~ Intercept + f(spatial.field, model=spde) The A matrix you've suggested for the aggregation is identical to the one I'm already using, so I'm not sure where the issue is arising. Manually running A %*% spatial.field runs just fine. Do I need a custom mapper to access the spde object with a list of indices rather than coordinate pairs?

comp <- ~ Intercept1(1) + field(spatial.field, model=spde) like1 <- like(formula = response ~ A %*% (Intercept1 + field), family = "gaussian", data = c(list(response = data$response), s.index), allow_combine = TRUE) fit <- bru(comp, like1) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

bradleyswilson commented 2 years ago

No luck. Still getting the same error (using both approaches), just with different numbers of rows.

finnlindgren commented 2 years ago

Can you send a reproducible example I can check/debug?

finnlindgren commented 2 years ago

The corrected A matrix can also be constructed via

INLA::inla.spde.make.block.A(inla.as.dgTMatrix(Diagonal(NROW(ips), ips$weight)),block=ips$polygonid,rescale=...)

that allows choosing and optional scaling method ("none" for plain integration or "sum" for spatial averaging).

bradleyswilson commented 2 years ago

Sure, this shows the same error on the gorillas data.


polygons <- gorillas$plotsample$plots
polygons@data <- tibble::rownames_to_column(polygons@data) # add group
mesh <- gorillas$mesh
spde <- inla.spde2.matern(mesh=mesh)

ips <- ipoints(samplers=polygons, domain=mesh, group="rowname")
# ipoints drops the grouping variable, adding back in
ips@data <- cbind(ips@data, polygonid = floor(as.numeric(tibble::rownames_to_column(ips@data)$rowname)))

A <- inla.spde.make.block.A(Diagonal(nrow(ips), ips$weight), block=ips$polygonid, rescale="sum")

comp <- ~ Intercept1(1) + field(cbind(x,y), model = spde)
data$response <- rnorm(nrow(polygons), sd = 0.4) 

like1 <-
  like(formula = response ~ A %*% (Intercept1 + field),
       family = "gaussian",
       data = c(list(response = data$response), as.list(as.data.frame(ips))),
       allow_combine = TRUE)

fit <- bru(comp, like1)
finnlindgren commented 2 years ago

I'm not sure precisely what error message you get, and for what part of the code. I made a few corrections to the code; e.g. inla.spde.make.block.A required a dgTMatrix (that function was meant to be used only internally, but it should really check the input instead of failing), and then only got a result[,k] <- result_ size mismatch on the final line; that was because the predictor result is a Matrix, and that wasn't handled correctly by the internal code. As easy workaround was to wrap it in as.vector(). I'll review if that's also the correct fix for the general internal code and fix it there later. With that fix, the example below runs (the Hessian warning makes sense for the IID example response data).

library(INLA)
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loading required package: parallel
#> Loading required package: sp
#> This is INLA_21.11.16 built 2021-11-16 14:00:50 UTC.
#>  - See www.r-inla.org/contact-us for how to get help.
#>  - To enable PARDISO sparse library; see inla.pardiso()
library(inlabru)
data("gorillas", package="inlabru")

polygons <- gorillas$plotsample$plots
polygons@data <- tibble::rownames_to_column(polygons@data) # add group
mesh <- gorillas$mesh
spde <- inla.spde2.matern(mesh=mesh)

ips <- ipoints(samplers=polygons, domain=mesh, group="rowname")
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
#> = prefer_proj): Discarded ellps unknown in Proj4 definition: +proj=longlat +R=1
#> +no_defs +type=crs
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
#> prefer_proj): Discarded datum unknown in Proj4 definition
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
#> = prefer_proj): Discarded ellps unknown in Proj4 definition: +proj=longlat
#> +R=6378137 +no_defs +type=crs
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
#> prefer_proj): Discarded datum unknown in Proj4 definition
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
#> = prefer_proj): Discarded ellps unknown in Proj4 definition: +proj=longlat
#> +R=6378137 +no_defs +type=crs
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
#> prefer_proj): Discarded datum unknown in Proj4 definition
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
#> = prefer_proj): Discarded ellps unknown in Proj4 definition: +proj=geocent +R=1
#> +units=m +no_defs +type=crs
#> Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj =
#> prefer_proj): Discarded datum unknown in Proj4 definition
# The rowname column is still there. Convert to numeric.
# head(ips)
# coordinates rowname       weight
# 1   (581.9648, 674.5587)       1 0.0006067968
ips@data$polygonid <- as.numeric(ips@data$rowname)

A <- inla.spde.make.block.A(
  inla.as.dgTMatrix(Diagonal(nrow(ips), ips$weight)),
  block=ips$polygonid,
  rescale="sum")

comp <- ~ Intercept1(1) + field(cbind(x,y), model = spde)
data <- data.frame(response = rnorm(nrow(polygons), sd = 0.4))

like1 <-
  like(formula = response ~ as.vector(A %*% (Intercept1 + field)),
       family = "gaussian",
       data = c(list(response = data$response), as.list(as.data.frame(ips))),
       allow_combine = TRUE)

fit <- bru(comp, like1)

Created on 2021-11-17 by the reprex package (v2.0.1)

bradleyswilson commented 2 years ago

The as.vector() appears to have fixed the mismatch error I was getting, at least in this example! I'll report back when I've attempted spinning up the full model. I really appreciate the troubleshooting help.

finnlindgren commented 1 year ago

This is getting closer to having full support in the package. In particular, there are mappers that can be used in the predictor expression for the aggregations. For examples in the unit tests, see https://github.com/inlabru-org/inlabru/blob/devel/tests/testthat/test-aggregate.R

nickschurch commented 4 months ago

@bradleyswilson did you figure out how to do this in the end? I'm new to inlabru but trying something similar with two point datasets and one areal gridded dataset and keep hitting blocks...

bradleyswilson commented 4 months ago

@nickschurch I believe I went back to using INLA syntax instead of inlabru for this particular instance, although I haven't tested the updates Finn has made since then.

My issue was mostly computational, which I solved another way with mesh projections in a custom tiling scheme.

ASeatonSpatial commented 2 months ago

@bradleyswilson I have just scanned through this but I believe this is now supported in the bru mapper system with the logsumexp mapper. We have included an example in the software paper that we are currently drafting (very nearly finished!). I can't share the data here but here is a sneak peek at the code:

# Define Matern/SPDE model
matern <- inla.spde2.pcmatern(
  mesh,
  prior.range = c(2, 0.1),
  prior.sigma = c(3, 0.1)
)

# Define the intermediate zone integration scheme
ips <- fm_int(mesh, samplers = glasgow_iz)
agg <- bru_mapper_logsumexp(
  rescale = FALSE,
  n_block = nrow(glasgow_iz)
)

# Define model components and formula
cmp <- ~ 0 + beta(1) + xi(main = geometry, model = matern)
fml <- count ~ ibm_eval(
  agg,
  input = list(weights = weight, block = .block),
  state = beta + xi
)

# Construct the likelihood
lik <- like(
  formula = fml,
  family = "poisson",
  data = ips,
  response_data = glasgow_iz
)

# Fit the model
fit <- bru(
  components = cmp,
  lik
)

Here glasgow_iz is an sf polygons object. The mappers have methods associated with them, one of which is ibm_eval which evaluates the mapper for a given set of inputs. In this case the mapper is specifically designed to integrate exp(predictor) and take the log of the result (so it's designed for use with the Poisson likelihood with a log link in mind). As far as I am aware in INLA all you can do is sum the log rate within each polygon, so this should hopefully be an improvement.