Closed anjelinejeline closed 7 months ago
Please provide a minimal reproducible example of problems when using predict
- best code and using a built-in data set. predict
for new data requires (as you see from the man page) the row.names
of the spatial weights and the new data object to agree. Please refer to https://doi.org/10.1080/17421772.2017.1300679 for details - the data and the weights must be aligned properly, see also https://r-spatial.org/book/17-Econometrics.html#sec-spateconpred for another example.
Much worse is trying to use CV with spatial data. Because of spatial dependence, information leaks from observation to observation, so that the CV permutations are not protected against contagion. For CV to work as expected, the observations cannot leak. Despite a good deal of usage by consultants hyping ML, random permutation when spatial dependence between observations may be present is very insecure and not adequately studied. Model outcomes and explanations will be biassed to an unknown degree by unmodelled spatial processes. See also papers on ML/CV challenges:
2012 https://ieeexplore.ieee.org/document/6352393 2018 https://doi.org/10.1016%2Fj.envsoft.2017.12.001 2019 https://doi.org/10.1111%2F2041-210X.13107 2019 https://doi.org/10.1016%2Fj.ecolmodel.2019.108815 2021 https://doi.org/10.1111%2F2041-210X.13650 2022 https://doi.org/10.1111%2F2041-210X.13851 2022 https://doi.org/10.1038%2Fs41467-022-29838-9 2023 https://doi.org/10.5194%2Fegusphere-2023-1308
I would never use R^2 under any circumstances. In this case it is a log-likelihood based Nagelkerke measure, but nobody knows how it performs. It is not appropriate to use y^hat, I think, except where the model is just a regular linear model with no spatial components and no other mis-specification problems. Likelihood ratio tests between nested models should be OK, but comparing non-nested models is not easy, maybe comparison of AIC or BIC/DIC.
It would help if you had linked this issue to https://github.com/tidymodels/spatialsample/issues/157 if they are linked - I'm assuming that they are given the closeness in time between their creation. I'd also prefer a good deal more context about your project, most helpfully in a minimal reproducible example.
@rsbivand thank you for your reply and all explanation :)
I am trying to explore if I can use these models and I still have to learn all the details and requirements behind them..
Thank you for your support
Please provide a minimal reprpducible example. Hand-waving (just concepts) does not work.
Okay I cannot provide you with the entire example but I will try to make my questions more clear. Please note that these are purerly theoretical questions regarding the model building process a
Background on the model building process
I started with building a classical OLS with the whole dataset , and selected the model structure in terms of covariates based on a stepwise selection considering the AIC. The model I kept working on was the one with the lowest AIC which I called
bestAIC_or
.
On this model I checked the assumptions of the residuals (constant variance and homoskedasticity) and moved to the spatial regressions models given that my residulas were not following a normal distributions and that heteroskedasticity was present. Also, I plotted them and I could detect spatial autocorrelation.
Before building a spatial regression I looked into the Moran I for different distances
ggplot(moran_I, aes(x = distance, y = moran)) +
geom_point() +
geom_line()
Based on the Moran I identified the groups of neighbors up to a distance of 50 km and I built the neighbor weight matrix using the function nb2listw
.
With this matrix I then performed the Lagrange multiplier diagnostics for spatial dependence to see if there's sense in building a spatial error models and spatial lag models. Both tests were significant but the spatial error model ha a higher LMErr value ( LMErr = 852529, df = 1, p-value < 0.00000000000000022
, LMlag = 504590, df = 1, p-value < 0.00000000000000022
). Thus I built both models with the whole dataset and checked the AIC.
# Fit spatial lag model
out_arid_lag=spatialreg::lagsarlm(formula=bestAIC_or, data = out_arid_sf,
list_spatial_lm$out_arid.lw, zero.policy = TRUE)
# Fit spatial error
out_arid_err=spatialreg::errorsarlm(formula=bestAIC_or, data = out_arid_sf,
list_spatial_lm$out_arid.lw, zero.policy = TRUE)
list_models_arid=list(out_arid_lag=out_arid_lag,
out_arid_err=out_arid_err)
AIC(list_models_arid$out_arid_lag,list_models_arid$out_arid_err)
Given that the spatial error had the lowest AIC I kept this one and compared it with the Spatial Durbin Error Model (nested model) (following your suggestion)
# Build the Spatial Durbin model
out_arid_sd_err=spatialreg::errorsarlm(formula=bestAIC_or, data = out_arid_sf,
list_spatial_lm$out_arid.lw, zero.policy = TRUE, etype = "emixed")
And compare it with the spatial error model using the likelihood ratio test H0: restrict the model to the spatial error model. H1: use the spatial durbin model
LR.Sarlm(out_arid_sd_err,list_models_arid$out_arid_err)
Based on these results it seems I should keep the spatial error model .. and I am quite happy with that .. but I would like to validate it and here come my doubts about
1.Diagnostics
I extracted the residuals from the spatial error model to check for normality
qqnorm(list_models_arid$out_arid_err$residuals)
qqline((list_models_arid$out_arid_err$residuals),col="red")
nortest::ad.test(list_models_arid$out_arid_err$residuals) #The test rejects the hypothesis of normality when the p-value is less than or equal to 0.05.
ggplot() +
geom_histogram(mapping = aes(x=list_models_arid$out_arid_err$residuals)) +
xlab("Spatial error model residuals")
and heteoskedasticity
plot(list_models_arid$out_arid_err$residuals)
and they clearly do not follow a normal distribution and the variance is not constant.
So here it comes my first question ... do the assumptions of the standard OLS hold for a spatial regression model? I looked into different websites providing example of spatial regressions and no one is checking the assumptions on the residuals (including the book you pointed me to). I also found that these assumptions do not hold with spatial regressions
E.g. https://www.emilyburchfield.org/courses/gsa/spatial_regression_lab
so I guess that I should have not worried about the residuals (please confirm that)
Consider three spatial clusters based on the points spatial coordinates using k-means clustering algorithm.
cluster_folds=spatial_clustering_cv(out_arid_sf, v = 2)
autoplot(cluster_folds)
Transform the spatial clustering object into a dataframe to applu the function
nfolds=nrow(cluster_folds) # Retrieve the number of spatial clusters
spatial_cv_dataset=map(1:nfolds, \(x) get_rsplit(cluster_folds,index=x)|> assessment() |>
mutate(fold = x)) |> list_rbind()
head(spatial_cv_dataset)
tail(spatial_cv_dataset)
Define the metrics to compute RMSE is the root of the mean of the squared error where error is the difference between the fitted (f) and observed values.
Define the function for training and iterating the model. The objects that should be passed are the number of folders k, the sf object containing the data, d2 the upper distance bound to identify neihbouring points, the listw object created on the WHOLE dataset, y the outcome variable in the sf object.
RMSE=function(observed, predicted){
rmse=mean((observed - predicted)^2) |> sqrt()
return(rmse)
}
spatial_cv_fun=function(k, sf,d2,lw,y){
# Define the training and testing dataset
training=filter(sf, fold != k)
testing=filter(sf, fold == k)
# Build the neighboruing matrix on the training dataset
training.nb=spdep::dnearneigh(training, d1 = 0, d2 = d2)
training.lw=nb2listw(training.nb, style = "W", zero.policy = TRUE)
model_training=spatialreg::errorsarlm(formula= bestAIC_or, # Provide the model formula
data = training,
training.lw , zero.policy = TRUE)
fitted=predict(model_training, newdata=testing,listw=lw)[,1] # fist coloumn of the predict object contains the fitted values
observed=testing |> dplyr::select(y) |> st_drop_geometry() |> pull() # extract the observed values
rmse=RMSE(observed=observed,predicted=fitted) # Provide the info on the column where the observed values are stored
return(rmse)
}
Try out the function on 1 fold
# Convert into sf objects given that this info was lost when creating the spatial folders
spatial_cv_dataset=st_as_sf(spatial_cv_dataset)
spatial_cv_fun(k=2, # N of folds
sf=spatial_cv_dataset, # dataset
d2=50000, # Upper distance of neighboring points
lw=list_spatial_lm$out_arid.lw, # lw object built on the entire dataset
y="focal_sum_ghe" # outcome variable
)
This is just an example but I want to create 10 k folds and compute the RMSE for every k.
What do you think about RMSE as a metric to validate the spatial error model?
interval
Thank you for this discussion Angela
What are your data, count of observations and covariates? What is the support of the data, polygon or point? And what are you trying to model? In most cases, the relative fit of a model is less important than the coefficient estimates and the standard errors of those coefficients. One would effectively never try to cherry-pick independent variables. If the underlying model specifies them, they should remain. Prediction is not even a minor goal. In a SEM, the spatial dependency does not enter the prediction. No standard errors or intervals of prediction are available - see the article.
Hi @rsbivand thanks for the rapid response, All variables and the outcome were normalized between 0 and 1 (I used the min-max normalization method).. the data are spatial points.
With regard to your answer .. I am really sorry but I am afraid I do not follow you. Could you please clarify trying to address the two doubts about diagnostics (residuals assumptions if should be checked) and validation (using the Nagelkerke pseudo-R-squared for the variance explained and compute the RMSE with a spatial cross validation approach )
Additionally I think that I am even more confused now :/
What do you mean with In a SEM, the spatial dependency does not enter the prediction.
Most of the purpose of spatial econometrics as originally conceived was to correct the standard error estimates of regression coefficients which are biassed when spatial autocorrelation is present but not modelled. Prediction was not discussed in any articles at all pre-2000, and only studied in any depth in https://doi.org/10.1080/17421772.2017.1300679. Haining (1990) does touch briefly on prediction, and also on diagnostics (influence plots, outliers), but there are no implementations in his work. Without software, no applications. The general practice has been either to fit using GMM/STSLS or ML/Bayesian methods. The LM tests came as a supplement to Moran's I for regression residuals, and all the tests respond to any mis-specification, not just spatial autocorrelation.
I inquired in https://doi.org/10.1007/s101090300096 about how prediction might be fashioned. The only trace that followed some time later was the realisation that impacts in the spatial lag and spatial Durbin models are not the same as in OLS, that is that the change in y_i from a unit change in x_ij is not \beta_j but a function of \beta_j and the autoregressive coefficient \rho.
In the spatial error model, there is no such feedback, and dy/dx_j is \beta_j. If \beta_j differs between OLS and SEM, a Hausman test will show whether the difference is caused by further mis-specification https://doi.org/10.1016/j.econlet.2008.09.003. LeSage & Pace propose that the SDM or SDEM would be better specified than SEM if the null hypothesis of no difference between OLS and SEM \beta is not accepted.
It has never been the usual practice to attempt to try to make fitted models fit better, because almost always the model is fitted to all the data. This may seem unlike other areas especially of non-spatial statistics. In spatial statistics, only geostatistics has been concerned with prediction, but there not for model fitting as such.
It is possible to predict using spatial econometrics type models using INLA, but for your purposes it is not even clear why using spatial econometrics makes any sense.
The link to an old class is out of date and does not reflect current practice, for moderately current practice see https://journal.srsa.org/ojs/index.php/RRS/article/view/44.1.2. See also https://doi.org/10.1111/jors.12188. Here too one would assume that the data entering the model would satisfy all the assumptions of OLS with the main exception of residual spatial autocorrelation.
I asked how many observations you have, how many variables are included in the model; I understand that the observations have point support and that the points are expressed by decimal degrees and cover the globe (figure above). There seem to be very many points, but you use the default estimation method for ML fitting, which is very likely highly sub-optimal. Without knowing what you are trying to do, I fear that you are using the wrong tool completely. If you have global point support data, and want to find a model that predicts best for moderately large data sets, but have no theoretical model informing your choice of covariates, then spatial econometrics is probably inappropriate. I have no idea what the response is. I thought the data might be real estate prices, but they would not be global.
Please tell me what you are trying to do, and I can see whether I can explain why your expectations of spatial econometrics models are inappropriate. This is not a problem with the software, which behaves as specified, but is a problem with your using spatial error models for purposes for which they were never intended.
Thank you for this comprehensive explanation. Indeed I have a very large dataset. Which method should I try if not ML?
With regard to my data the original outcome variable is a count variable: number of events occuring in a location which I weighted considering the expenditure of the country where the locations are. This weighted variable was then normalized between 0 and 1 along with all covariates. Do you suggest transforming them instead of making them on the same scale to meet the assumptions of the OLS before even building a spatial regression?
Why do you think that the model may not be appropriate?
The goal is to find a relationship between the covariates and the outcome ..it's not to predict in the strict sense on a new scenario but I wanted to use the predict function to validate the model and use the fitted values on the testing k fold to calculate the RMSE. Are you saying that the fitted values returned by the predict function may be wrong if a spatial error model is passed ?
It still not clear to me if I can use the Nagelkerke pseudo-R-squared for the variance explained
Thank you for your feedback
ML or Bayesian may be OK if you choose a different method= argument value than the default; ML or Bayesian permit the use of goodness of fit comparisons like likelihood ratio, GMM/STSLS do not.
If your data are counts, use rather poisson or negative binomial. I suspect that your "weighting" is actually dividing by some quantity, which in GLMM terms would be an offset. If you are an economist and are looking at elasticities (probably are not) the log link in a GLM would be something like that. I never rescale any covariates (or the response) because the coefficients cannot be interpreted in measured units.
If you are only interested in accounting for any residual autocorrelation in the model, use for example mgcv::gam with an "mrf" smooth; this also permits the proper use of count variables. You do not need the specific SEM model you chose, rather you need a more standard model permitting an additive spatial smooth, as far as I can tell from the very vague description of your research question and what the data actually represent. If you are an economist and arre concerned that reviewers in narrow economics journals will require use of SEM, try a different journal.
If predictions are not important, then the only thing that matters is the precision of the coefficient estimates. The precision (the size of the standard error estimates) is affected by unmodelled residual spatial autocorrelation, which biasses it downwards. Briefly, if there are n observations, which are independent, (n-k) or for ML (n-1) is a reasonable number with which to divide the sum of squared errors. If the residuals show positive spatial autocorrelation, there are not n independent observation, but rather n* << n, because they are dependent on each other. The loss of effective degrees of freedom pulls the rug out from under any RMSE or similar, probably also with spatial folds. There is no helpful literature of which I am aware showing that CV is any use in areal spatial data analysis. If you need to compare models, Bayesian model averaging is at least represented in the literature. The term "validating the model" really does not apply as such when you are fitting models to observational data which is very unlike random samples.
No, the appropriate measure is the likelihood ratio test between nested spatial econometrics models if fitted with ML. The Nagelkerke measure is 1 - exp(-(2/n)*(logLik(obj) - nullLL))
where obj
is the fitted model and nullLL
is the log likelihood of an OLS model with the same response and only the intercept (all we know about the response is its mean), so a function of the log likelihoods. The log likelihood of SEM takes account of the presence of spatial autocorrelation as well as a function of the sum of squared errors, for OLS, only a function of the sum of squared errors. Using a likelihood ratio test compares the goodness of fit of nested models, pseudo-R-squared is just a number.
Thank you for your rapid response, If I go for a GAM model and use the offset option ..would you rescale the outcome variable and the offset variable between 0-1 or would it be better to rescale just the covariates?
I never rescale any variables, there is no reasonable motivation other than possibly avoiding numerical problems. If the natural scales of covariates or reponse lead to numerically small or large coefficients, I multiply or divide by a function of 10 (say convert metres to kilometres). That keeps the coefficient values numerically stable and directly interpretable.
Thanks for your response and sharing your thoughts on this .. Can a spatial negative binomial regression be built with spatreg? I am not sure whether it is supported .. it seems that it does support only lm ...I rather prefer a classical statistical approach rather than a GAM
BTW these articles predict using spatial econometrics models
https://www.nature.com/articles/s41598-017-08254-w https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0282524
and also re-reading the book you pointed me too they actually used the predict function ... https://r-spatial.org/book/17-Econometrics.html#sec-spateconpred
I apologize but it's still not clear to me why it is not possible to predict using these models :/ :(
You can fit poisson and negbin models using many of the disease mapping packages (unsure about negbin). Most are Bayesian (CARBayes, etc.), but hglm can also be used. The big difference between spatial econometrics models (from Ord 1975, via Anselin 1988 to LeSage & Pace 2009 and onwards) is that SE models do not admit a spatially structured random effect in addition to the residuals. SE starts from y = X b + e; disease mapping adds a spatially structured random effect (u): y = Xb + u + e. In practice there isn't much difference, but the two traditions separated in the 1970's (Ord or Besag). You can see this in Dumelle et al.'s equation 2.
spatialreg provides spatial econometrics-type model fitting functions, not because they are better, but because that is what was needed for teaching spatial econometrics. Lots is easier with the disease mapping-type models.
spmodel
is a new package, and so far I haven't checked what is comparable with spatialreg
(probably only SEM). I don't see non-Gaussian responses.
Yes, spatialreg::predict
works correctly, but the row.names of the listw object (in the region.id attribute) and the fitting data and the new data must agree, in order to find and use feed-through from data to new-data observations. That is why your problem is only practical, and why I asked for a complete minimal reproducible example.
Thank you for your rapid response 😀
For the predict issue now it's more clear and I was able to use it in the cross validation and compute the RMSE as I posted above.
One last doubt remains with regard to the pseudo Rsquared .. why do you say that it's just a number ? How can it be interpreted ? If it's in the summary there's a reason .. also in glm I use a pseudo R squared and I interpret it as as a measure of improvement over the null model .. so I am confused that you said that I cannot use it as a metric for the goodness of fit
Yes, the pseudo R squared is a measure of improvement over the null model, but its scaling probably isn't comparable with the linear model coefficient of determination. And comparing across models isn't obvious.
Got it .. Many thanks for this interesting discussion :)
Hi I am looking into a way to validate a spatial error model I built.
From one hand I would like to perform a cross validation but I can't predict on a dataset with different rows id using the predict function .. and I guess it is because of the neighborhood matrix.
Is there a way to validate the model on a different dataset?
Also is the Rsquared computed as a good measure for the model fit in case of a spatial error ? Or do you suggest using other measures
Thank you Angela