inlabru-org / inlabru

inlabru
https://inlabru-org.github.io/inlabru/
76 stars 21 forks source link

Fitting lgcp: error "arguments imply differing number of rows" #118

Closed gaiteiro closed 3 years ago

gaiteiro commented 3 years ago

Hi,

I'm running into a error message fitting a model with certain covariates, but not with others. Here is the error:

fit1 <- lgcp(comp1, data = grid2, samplers = boundary2, domain = list(coordinates = Mesh6)) Error in data.frame(x = integ$loc[, 1], y = integ$loc[, 2], weight = integ$weight, : arguments imply differing number of rows: 0, 1 traceback() 7: stop(gettextf("arguments imply differing number of rows: %s", paste(unique(nrows), collapse = ", ")), domain = NA) 6: data.frame(x = integ$loc[, 1], y = integ$loc[, 2], weight = integ$weight, group = g) 5: bru_int_polygon(domain, poly_segm, method = int.args$method, nsub = int.args$nsub2) 4: ipoints(samplers, domain$coordinates, group = samp.dim, int.args = int.args) 3: ipmaker(samplers = samplers, domain = domain, dnames = response, int.args = options[["bru_int_args"]]) 2: like(family = "cp", formula = formula, data = data, samplers = samplers, E = E, ips = ips, domain = domain, options = options) 1: lgcp(comp1, data = grid2, samplers = boundary2, domain = list(coordinates = Mesh6))

...and here is the code that gave rise to the error, (Windows 10, R4.1.0, Inlabru 2.3.1) along with some code that didn't throw an error:

require(INLA)
require(rgdal)
require(rgeos)
library(inlabru)
library(sp)
library(raster)
library(spdep)
# read in data
grid2 <- readOGR(paste(getwd(), "/31K_envs_clipped_cop95c.shp", sep = "")) 
# change m to km scale
grid2 <- spTransform(grid2, "+proj=utm +zone=54 +south +ellps=WGS72 +towgs84=0,0,1.9,0,0,0.814,-0.38 +units=km +no_defs")
# read in boundary
boundary2 <- readOGR(paste(getwd(), "/NG_boundary_clipped_byecoreg_UTM_snglprt.shp", sep = ""))
boundary2 <- spTransform(boundary2, "+proj=utm +zone=54 +south +ellps=WGS72 +towgs84=0,0,1.9,0,0,0.814,-0.38 +units=km +no_defs")
# remove records with NAs
require(spatialEco)
grid2 <- sp.na.omit(grid2, col.name = "cost")
grid2 <- sp.na.omit(grid2, col.name = "cec")
grid2 <- sp.na.omit(grid2, col.name = "temp10")
grid2 <- sp.na.omit(grid2, col.name = "bio7")
# scale variables
grid2@data[c("cost", "cec", "temp10", "therm")] <- scale(grid2@data[c("cost", "cec", "temp10", "therm")])
grid2@data[c("tri", "fpar", "lai", "soc0_5")] <- scale(grid2@data[c("tri", "fpar", "lai", "soc0_5")])
grid2@data[c("sq3", "sq5", "bio7", "dem")] <- scale(grid2@data[c("sq3", "sq5", "bio7", "dem")])
grid2@data$cop95_1 <- as.factor(grid2@data$cop95_1)
Mesh6 <- inla.mesh.2d(loc=grid2, boundary=boundary2, max.edge = c(25, 250), cutoff = 10,
                      crs="+proj=utm +zone=54 +south +ellps=WGS72 +units=km +no_defs")
matern3 <- inla.spde2.pcmatern(Mesh6,
                               prior.sigma = c(1, 0.01),
                               prior.range = c(25, 0.01))

# this works:
comp2 <- coordinates ~
  cost(cost, model = "linear") +
  mySmooth(coordinates, model = matern3) + Intercept(1)
fit1 <- lgcp(comp2, data = grid2, samplers = boundary2, domain = list(coordinates = Mesh6))

# this formula leads to an error when calling lgcp with the code above:
comp1 <- coordinates ~
  tri(tri, model = "linear") +
  mySmooth(coordinates, model = matern3) + Intercept(1)
# so does this
comp1 <- coordinates ~
  cop95_1(cop95_1, model = "factor_contrast") + Intercept(1)
# and this
comp1 <- coordinates ~
  cop95_1(cop95_1, model = "factor_full") - 1

...in fact the regression produces results for any combination of a dozen or so covariates and/or a spatial term, but I get an error when 'tri' or the factor 'cop95_1' are introduced. All covariates have 27335 observations with coordinates; all covariates except the factor are numeric:

summary(grid2@data$tri) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.2830 -0.8967 -0.1923 0.0000 0.7430 5.2999 sum(is.na(grid2@data$tri)) [1] 0

...I attach the two shapefiles FYI. Best regards, Michael

shapefiles.zip

ASeatonSpatial commented 3 years ago

Is grid2 a spatial points data frame? I think it should be points right, if you are using lgcp? But the covariates need to be defined everywhere in the spatial domain. Suspect this is the issue.

ASeatonSpatial commented 3 years ago

I think having a look at this might help you:

https://inlabru-org.github.io/inlabru/articles/web/2d_lgcp_covars.html

You need to specify a function to look up continuous covariates or use the sp object directly in the model component definition. It looks like you are passing the covariates as data but that should be the point pattern object.

finnlindgren commented 3 years ago

The covariates should be allowed to be SpatialGridDataFrame or SpatialPixelsDataFrame objects, which should trigger spatial evaluation of them; that function is built-in so the user doesn't need to build their own function for that. But if they are other types of objects, an evaluator function is needed. However, it seems you store all the cvariates in grid2, and just have different column names for each of them? Then the syntax

~ cost(cost, model = "linear")

is incorrect. Instead, you need

~ cost(grid2, main_layer = "cost", model = "linear")

Also, you need to supply the observed point pattern as data to lgcp(), not the covariate grid grid2.

gaiteiro commented 3 years ago

Thanks both! grid2 was a spatial points data frame, the covariates were in the attribute table of the shapefile, after loading the .shp file into R the covariates are grid2@data. However I got the same error without trying to include covariates:

comp2 <- coordinates ~ 
  mySmooth(coordinates, model = matern3) + Intercept(1)
fit1 <- lgcp(comp2, data = grid2, samplers = boundary2, domain = list(coordinates = Mesh6))

I must admit I'm not sure how to supply the spatial points as data, I tried this:

coords <- coordinates(grid2)
comp2 <- coords ~  cost(grid2, main_layer = "cost", model = "linear") 
fit1 <- lgcp(comp2, data = coords, samplers = boundary2, domain = list(coordinates = Mesh6))

...but got the same error. And grid2@coords returns a matrix so that wouldn't work ...sorry this probably seems a bit elementary.

Looking at the gorillas example, I see that 'nests' is a spatial points data frame and e.g. gcov$vegetation is a spatial pixels data frame. So maybe I could have imported my covariate rasters and referred to them in the formula. However, I was hoping that by first extracting the point values of the rasters to the points in grid2, this would allow me to subset grid2 to remove any points with underlying covariate values as NA.

finnlindgren commented 3 years ago

The observation part of a point process is the combination of a set of observed points, and the region where observations could occur. The lgcp() function builds the inla likelihood approximation for such a model, and to do that, it needs to evaluate the covariates on the entire observation region, not just at the location of the points. So unless you do your own likelihood construction (e.g. by manually constructing the integral construction for the likelihood and calling bru() directly), you must supply the whole spatial field of covariate values; inlabru needs the entire covariate raster as component input to be able to evaluate the covariates at the internally constructed integration points. So the problem is your last sentence; pre-extracting the covariate values at only the locations given by the observed point pattern is the opposite of what's needed. To properly define the point process likelihood, the covariate must be non-missing throughout the observation region (or filled-in, either with basic nearest-value infill which is done automatically or with a direct call to bru_fill_missing(), or with an explicit joint model for the covariate field itself).

finnlindgren commented 3 years ago

If you have large connected regions with missing coordinate values, you might consider excluding those from the definition of the observation region (samplers), but only do that if you're confident that the missingness is independent of the process itself, since otherwise it changes the model interpretation.

gaiteiro commented 3 years ago

Oh I see - I was way off there...thanks for setting me on the right track.

I have some missing values because different rasters and data sources have differing ideas of where the land / sea boundary is. The percentage is tiny but I wasn't sure if lgcp would handle the NAs.

finnlindgren commented 3 years ago

OK, for such small boundary-missing-values, the bru_fill_missing method should suffice; it uses the nearest available value. (And it was introduced specifically to handle that type of missingness.)

gaiteiro commented 3 years ago

Great! I'll try that...

finnlindgren commented 3 years ago

If/when you sort it out, please post a follow-up on the r-inla list thread about it!

ASeatonSpatial commented 3 years ago

@finnlindgren wonder if to avoid confusion like this lgcp could accept two data objects? One for points and one for covariates?

gaiteiro commented 3 years ago

Yes I'd better post the code if I can sort it out..otherwise my original post will mystify any readers!

finnlindgren commented 3 years ago

I got an email notification of a follow-up message, but it doesn't appear here; short story, yes, if you hadn't included any covariate terms, the model should run, given that grid2 was a SpatialPointsDataFrame containing the observed point pattern.