inlabru-org / inlabru

inlabru
https://inlabru-org.github.io/inlabru/
76 stars 21 forks source link

Improve documentation of prediction features, was: "Error when using data.frame instead of SpatialPoints* on lgcp" #140

Closed silasprincipe closed 1 year ago

silasprincipe commented 2 years ago

Hi! I'm using inlabru to predict distribution of species. Because I also want to predict the model to a new (future) scenario, I need to fit the model using a data.frame as input instead of SpatialPoints (so inlabru will be able to connect the new data coordinates to the SPDE, right?).

I was running a test using the example of gorillas, but converting the inputs to data.frame, what throws an error:

# Load the Gorilla data
data(gorillas, package = "inlabru")

# Define SPDE prior
matern <- INLA::inla.spde2.pcmatern(gorillas$mesh,
                                    prior.sigma = c(0.1, 0.01),
                                    prior.range = c(0.01, 0.01)
)

# Convert to data.frame
ndat <- as.data.frame(gorillas$nests)

cmp <- coordinates ~ mySmooth(coordinates, model = matern) + Intercept(1)

fit <- lgcp(cmp, data = ndat,
            samplers = gorillas$boundary,
            domain = list(coordinates = gorillas$mesh),
            options = list(control.inla = list(int.strategy = "eb"))
)

**Error in setdiff(names(data), data_coordnames) : 
  object 'data_coordnames' not found**

If I supply the integration points as a data.frame, then it works:

ipt <- ipoints(gorillas$boundary, gorillas$mesh)
ipt <- as.data.frame(ipt)

#Fit using new approach
fit1 <- lgcp(cmp, data = ndat,
            ips = ipt,
            domain = list(coordinates = gorillas$mesh),
            options = list(control.inla = list(int.strategy = "eb"))
)

# Fit using standard approach
fit2 <- lgcp(cmp, gorillas$nests,
            samplers = gorillas$boundary,
            domain = list(coordinates = gorillas$mesh),
            options = list(control.inla = list(int.strategy = "eb"))
)

# Same results!

I suspect this is a problem with the like function, between lines 92 and 97:

if (ips_is_Spatial) {
                        non_coordnames <- setdiff(names(data), data_coordnames)
                        data <- sp::SpatialPointsDataFrame(coords = data[new_coordnames], 
                                data = data[non_coordnames], proj4string = data_crs, 
                                match.ID = FALSE)
                }

When the data is supplied as data.frame, the names of the coordinates of the data are not supplied to the variable data_coordnames, which thus causes the error.

finnlindgren commented 2 years ago

Can you let me know where you found a suggestion that you would need turn the Spatial object into a data.frame for this? It's the complete opposite!

When using coordinates as input to a component, it will internally call the sp::coordinates() function on the data object, which only works if the input is a Spatial object.

For non-spatial objects, such as plain data.frame objects, you need to explicitly use the vectors containing the coordinates, but for lgcp models (and family="cp" models in general) there is currently no way to specify a pair of coordinates to match to a spatial domain and also automate the integration points with the domain argument, so for now the easiest approach is to use Spatial input data instead.

silasprincipe commented 2 years ago

Hi Finn! Sorry for that, I think I misunderstood everything... I got into this 'solution' after reading the help of the lgcp (which says that you can supply the data as a data.frame) and delving into the inla discussion group (multiple questions, not exactly on the lgcp). From that I wondered that if I was to predict to new data, the easiest way was to supply it as data.frame... bad call, sorry again!

But if the only way is to use a Spatial object, how can I ask the predict function to look into a new set of covariate values when predicting?

I tested with a SpatialPoints object but the outcome was the same, even with a different set of values. I'm probably missing something.

# Fit the covariate example
comp1 <- coordinates ~ vegetation(gcov$vegetation, model = "factor_full") - 1
fit1 <- lgcp(comp1, nests, samplers = boundary, domain = list(coordinates = mesh))

df <- pixels(mesh, mask = boundary)
pred1 <- predict( fit1, df, ~exp(vegetation))

ggplot()+ gg(pred1)

# Modify the vegetation covariate
veg2 <- gcov$vegetation
# Primary is where there is the highest intensity, so just change grassland by primary
veg2$vegetation[veg2$vegetation == "Grassland"] <- "Primary"
# Convert to SpatialPointsDataFrame (following predic.bru help)
veg2 <- as(veg2, "SpatialPointsDataFrame")

# Try and plot --- same results as the one with the original set...
pred2 <- predict(fit1, veg2, ~exp(vegetation))

ggplot()+gg(pred2, aes(color = mean), size = 3)
finnlindgren commented 2 years ago

predict() will evaluate the component inputs using the same mechanism as the estimation code (i.e. evaluate gcov$vegetation at the specified coordinates), so if you want to do "counterfactual" prediction, or use specific covariate values for each location instead of extracting values from the component input covariate specification, you can use vegetation_eval() that takes arbitrary input values. For example, vegetation_eval("Primary") or if you have a column myveg in the prediction SpatialPointsDataFrame object, vegetation_eval(myveg).

silasprincipe commented 2 years ago

Hi Finn! Now I got it, many thanks! It is a really simple solution, very functional. I will just leave the code in the end, in case someone have the same question in the future.

Can I suggest to add one example, or just one extra explanation, about this behavior on the documentation? Usually those working with species distribution modelling (or things alike) will predict in rasters with different values (e.g. temperature today, temperature in 2100), but the prediction system on INLA is sometimes harder to understand. This solution of inlabru, on the other side, is quite straightforward after you get it!

Thanks again.

# Fit as usual
comp1 <- coordinates ~ vegetation(gcov$vegetation, model = "factor_full") - 1
fit1 <- lgcp(comp1, nests, samplers = boundary, domain = list(coordinates = mesh))

# Predict on the grid
df <- pixels(mesh, mask = boundary)
pred1 <- predict(fit1, df, ~exp(vegetation))

# If we modify the object that is called on the formula and predict
# the model again, it will now use the new object.
# Make a dumb modification:
gcov$vegetation$vegetation[gcov$vegetation$vegetation == "Grassland"] <- "Primary"
pred2 <- predict(fit1, df, ~exp(vegetation))

# Plot - it works.
p1 <- ggplot()+gg(pred1)
p2 <- ggplot()+gg(pred2)
multiplot(p1, p2)
finnlindgren commented 2 years ago

OK, good! Yes, we need an updated and simplified vignette showing these principles; the existing vignettes were written before all these features were implemented... We also need to simplify the case pf the user not wanting to modify the covariate grid, but instead supply an entirely new version. To do this, the user can use the _eval feature, but we need to make the grid evaluator method more visible (currently it might not even be exported and is only internal). I expect us to revisit that as we work on the sf/terra/stars support.

finnlindgren commented 1 year ago

The family="cp" code and all the spatial code has received a big overhaul lately, including support for sf and terra inputs in many places (pixels and some other functions can still only produce Spatial* output), so I'm closing this bug.

finnlindgren commented 1 year ago

The vignettes still need some updating, but there is now a component vignette that discusses some of these issues.