Open MGousseff opened 8 months ago
Could you perhaps provide a minimal reproducible example? Might something in these lecture notes help https://rsbivand.github.io/PG_AGII_2sem/11_talk.html ?
Thank you very much for the link. I will try to produce minimal data for a reproducible example as soon as possible, as the example you linked to seems to be the same situation. I'll try it as soon as possible. I still have to see what is the best way to provide the minimal data and how to look them here.
Broke my hand so coding is slow. Still trying to understand my bug and produce minimal example, sorry for the delay
Dear Roger,
I've been running some SAR models with the spatialreg package to analyze raster data in a project which main objective is to improve forest structural models by incorporating data on recent forest disturbance. The SAR models are working fine but I would like to use them for making predictions. Thus I split the data into a training data set (70% of data) and a test data set (30% of data), ran a SAR model using the errorsarlm function with the training data set and use the predict function to predict values in the test data set. Although everything runs, it looks like the predict function ignores the spatial component when using a new data set for making predictions. I was wondering if I'm doing something wrong here and if you have any suggestions for making predictions with SAR models using training and testing datasets. I've used the examples in your lectures for writing this script and I also read the papers you recommended on GitHub Issue #45. Below you will find an example of the script and attached is a sample of the data set I'm working with. The original data set has 4484 observations and 33 variables (including geographic location X and Y). Most of the variables are continuous and comprise information about environmental conditions (topography, soil) and forest disturbance. The variable I'm trying to predict is LAI (stands for leaf area index and it's called lai_harv on data set). Data was originally a raster with several layers that I transformed into a data frame using the centroid of each pixel as X and Y locations (in UTM). Thank you very much for your time and help! Best,
Adriana
lai_df.csv
# R script for Github Issue --making predictions with SAR ----
## Load libraries ----
library(tidyverse)
library(caret)
library(spatialreg)
## Load data ----
### This is a subset of the original data, which is 4484x33
#load("data/L2/lai_df.RData")
read_csv("data/L2/lai_df.csv")
## Data partition ----
### Set seed for random number generation
set.seed(2024)
### Split data
index <- caret::createDataPartition(lai_df$lai_harv,
p= 0.7,
list = F,
times = 1)
### Create train_df and test_df using data partition
train_df <- lai_df[index,]
test_df <- lai_df[-index,]
## Neighborhood matrix ----
loc_dist <- 36.48985
### I'm using 36.48 m as neighborhood distance
### X and Y of full data frame
loc_all <- (lai_df[,2:3])
### X and Y of rain data
loc_train <- (train_df[,2:3])
### Test data
loc_test <- (test_df[,2:3])
# Make list of neighborhoods using loc_dist (min distance between points)
neighbors_dist_all <- spdep::dnearneigh(loc_all, 0, d2 = loc_dist, longlat = F)
neighbors_dist_train <- spdep::dnearneigh(loc_train, 0, d2 = loc_dist, longlat = F)
neighbors_dist_test <- spdep::dnearneigh(loc_test, 0, d2 = loc_dist, longlat = F)
# Spatial weights
sp_all <- spdep::nb2listw(neighbors_dist_all, style = "W", zero.policy = T)
sp_train <- spdep::nb2listw(neighbors_dist_train, style = "W", zero.policy = T)
sp_test <- spdep::nb2listw(neighbors_dist_test, style = "W", zero.policy = T)
## SAR ----
#### Compute SAR model with errorsarlm function (which is the one that works with the predict function)
sar_lai <- errorsarlm(formula = lai_harv ~ dtm_harv + summer_insolation_harv + roughness + ndvi_growMag + ndmi_sdIndex + x_sc + ndvi_2018 + tcw_2018,
data = train_df,
listw = sp_train,
zero.policy = T)
summary(sar_lai)
### Prediction within data set
predict(sar_lai)
### Prediction using the test data set
lai_predictions_sar <- predict(sar_lai, newdata = test_df, listw = sp_all, zero.policy = T, pred.type = "TS")
### what happens with the spatial component of the model?
Fitting a spatial error model may change the fitted regression coefficients compared to least squares, so the prediction will differ. The spatial term applies to the error, which is not observed, so as we saw from email exchange in April, there is no clear path forward. May I add (some of) my email comments to this thread?
Since April, I gave a talk in part about this, slides at: https://rsbivand.github.io/nem24_talk/, sources at https://github.com/rsbivand/nem24_talk, talk recording at: https://nhh.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=d26410ee-6243-48ce-96dd-b18400beb764, and see https://github.com/rsbivand/nem24_talk/issues/1.
Dear Roger, Thank you very much for your reply and the resources! Sorry to hear there's no clear path forward but glad to know there's research being developed on the topic. Of course, feel free to add your email comments to this thread. Best,
Adriana
Comments from email in April:
Please see these two recent issues on the spatialreg repository: https://github.com/r-spatial/spatialreg/issues/45, https://github.com/r-spatial/spatialreg/issues/47. Please follow up the references there, and maybe contribute there - preferably #47 which is still open, also consider providing there a minimal reproducible example. Using predictions to test/validation data sets to assess model goodness of fit with spatial data is always highly problematic, as the references in #45 show. You are right that there is no predict() method for spautolm objects. For SAR, use errorsarlm(), for CAR use some other fitting function in another package, mgcv::gam() with an mrf smooth for example, or INLA::inla.
Thank you for the great spatial tools you provide to the community.
Maybe I'm doing something wrong as I'm rather new to spatial regressions, but here is a behavior I find strange.
if I fit a spatial regression model the way it is suggested in the man page, it works and I get some fitted values in the object I get. For instance :
formula1<-paste0( "to_predict ~ predictor_1 + predictor_2") fit.sem1<-lagsarlm(formula1, data=df_b, listw = neighbs_weights_b)
will work. Then my
fit.sem
object has a$fitted.values
and it seems everything works. df_b is an sf object, and neghbs_weights_b have been produced as intended.But when I try to use this model to predict my dependent variable on another location, using something like :
predict.sem2.arras<-predict(object=fit.sem2, newdata = df_a, listw = neighbs_weights_a, pred.type = "TS", legacy.mixed=TRUE, power=TRUE)
I get an error message :
It is a bit confusing, as I thought my data was wrong, like missing one of the predictors. I checked the way the weights are provided, including how the row.names need to be specified so that they are different in the training area and the testing area.
I found, thanks to the code of an article by Thibault Laurent (thank him for making it available), that the proper use seems to be:
All in all, as it is working, maybe it is just that I'm not aware of practice that may seem obvious to the community, and if so, sorry for the inconvenience, but as an R user, it is rather puzzling that feeding the formula "almost" works and feeding the linear object "fully" works. Maybe just modifying the error message in predict.sarlm would be useful ?
If the issue is considered useful and the behavior can not be reproduced, I'll try and provide a small data set to reproduce it.