r-spatial / spatialreg

spatialreg: spatial models estimation and testing
https://r-spatial.github.io/spatialreg/
43 stars 11 forks source link

Goodness of fit of a spatial error model #45

Closed anjelinejeline closed 7 months ago

anjelinejeline commented 8 months ago

Hi I am looking into a way to validate a spatial error model I built.

From one hand I would like to perform a cross validation but I can't predict on a dataset with different rows id using the predict function .. and I guess it is because of the neighborhood matrix.

Is there a way to validate the model on a different dataset?

Also is the Rsquared computed as a good measure for the model fit in case of a spatial error ? Or do you suggest using other measures

Thank you Angela

rsbivand commented 8 months ago

Please provide a minimal reproducible example of problems when using predict - best code and using a built-in data set. predict for new data requires (as you see from the man page) the row.names of the spatial weights and the new data object to agree. Please refer to https://doi.org/10.1080/17421772.2017.1300679 for details - the data and the weights must be aligned properly, see also https://r-spatial.org/book/17-Econometrics.html#sec-spateconpred for another example.

Much worse is trying to use CV with spatial data. Because of spatial dependence, information leaks from observation to observation, so that the CV permutations are not protected against contagion. For CV to work as expected, the observations cannot leak. Despite a good deal of usage by consultants hyping ML, random permutation when spatial dependence between observations may be present is very insecure and not adequately studied. Model outcomes and explanations will be biassed to an unknown degree by unmodelled spatial processes. See also papers on ML/CV challenges:

2012 https://ieeexplore.ieee.org/document/6352393 2018 https://doi.org/10.1016%2Fj.envsoft.2017.12.001 2019 https://doi.org/10.1111%2F2041-210X.13107 2019 https://doi.org/10.1016%2Fj.ecolmodel.2019.108815 2021 https://doi.org/10.1111%2F2041-210X.13650 2022 https://doi.org/10.1111%2F2041-210X.13851 2022 https://doi.org/10.1038%2Fs41467-022-29838-9 2023 https://doi.org/10.5194%2Fegusphere-2023-1308

I would never use R^2 under any circumstances. In this case it is a log-likelihood based Nagelkerke measure, but nobody knows how it performs. It is not appropriate to use y^hat, I think, except where the model is just a regular linear model with no spatial components and no other mis-specification problems. Likelihood ratio tests between nested models should be OK, but comparing non-nested models is not easy, maybe comparison of AIC or BIC/DIC.

rsbivand commented 8 months ago

It would help if you had linked this issue to https://github.com/tidymodels/spatialsample/issues/157 if they are linked - I'm assuming that they are given the closeness in time between their creation. I'd also prefer a good deal more context about your project, most helpfully in a minimal reproducible example.

anjelinejeline commented 8 months ago

@rsbivand thank you for your reply and all explanation :)

I am trying to explore if I can use these models and I still have to learn all the details and requirements behind them..

Thank you for your support

rsbivand commented 8 months ago

Please provide a minimal reprpducible example. Hand-waving (just concepts) does not work.

anjelinejeline commented 8 months ago

Okay I cannot provide you with the entire example but I will try to make my questions more clear. Please note that these are purerly theoretical questions regarding the model building process a

Background on the model building process

I started with building a classical OLS with the whole dataset , and selected the model structure in terms of covariates based on a stepwise selection considering the AIC. The model I kept working on was the one with the lowest AIC which I called bestAIC_or. On this model I checked the assumptions of the residuals (constant variance and homoskedasticity) and moved to the spatial regressions models given that my residulas were not following a normal distributions and that heteroskedasticity was present. Also, I plotted them and I could detect spatial autocorrelation.

Before building a spatial regression I looked into the Moran I for different distances

ggplot(moran_I, aes(x = distance, y = moran)) + 
  geom_point() +
  geom_line()

image

Based on the Moran I identified the groups of neighbors up to a distance of 50 km and I built the neighbor weight matrix using the function nb2listw.

With this matrix I then performed the Lagrange multiplier diagnostics for spatial dependence to see if there's sense in building a spatial error models and spatial lag models. Both tests were significant but the spatial error model ha a higher LMErr value ( LMErr = 852529, df = 1, p-value < 0.00000000000000022, LMlag = 504590, df = 1, p-value < 0.00000000000000022). Thus I built both models with the whole dataset and checked the AIC.

# Fit spatial lag model
out_arid_lag=spatialreg::lagsarlm(formula=bestAIC_or, data = out_arid_sf, 
                                  list_spatial_lm$out_arid.lw, zero.policy = TRUE)

# Fit spatial error
out_arid_err=spatialreg::errorsarlm(formula=bestAIC_or, data = out_arid_sf, 
                                  list_spatial_lm$out_arid.lw, zero.policy = TRUE)

list_models_arid=list(out_arid_lag=out_arid_lag,
                      out_arid_err=out_arid_err)

AIC(list_models_arid$out_arid_lag,list_models_arid$out_arid_err)

image

Given that the spatial error had the lowest AIC I kept this one and compared it with the Spatial Durbin Error Model (nested model) (following your suggestion)

# Build the Spatial Durbin model 
out_arid_sd_err=spatialreg::errorsarlm(formula=bestAIC_or, data = out_arid_sf, 
                                  list_spatial_lm$out_arid.lw, zero.policy = TRUE, etype = "emixed")

And compare it with the spatial error model using the likelihood ratio test H0: restrict the model to the spatial error model. H1: use the spatial durbin model

LR.Sarlm(out_arid_sd_err,list_models_arid$out_arid_err)

image

Based on these results it seems I should keep the spatial error model .. and I am quite happy with that .. but I would like to validate it and here come my doubts about

1.Diagnostics

I extracted the residuals from the spatial error model to check for normality

qqnorm(list_models_arid$out_arid_err$residuals)
qqline((list_models_arid$out_arid_err$residuals),col="red")
nortest::ad.test(list_models_arid$out_arid_err$residuals) #The test rejects the hypothesis of normality when the p-value is less than or equal to 0.05.

image

image

ggplot() + 
  geom_histogram(mapping = aes(x=list_models_arid$out_arid_err$residuals)) +
  xlab("Spatial error model residuals")

and heteoskedasticity

plot(list_models_arid$out_arid_err$residuals)

image and they clearly do not follow a normal distribution and the variance is not constant.

So here it comes my first question ... do the assumptions of the standard OLS hold for a spatial regression model? I looked into different websites providing example of spatial regressions and no one is checking the assumptions on the residuals (including the book you pointed me to). I also found that these assumptions do not hold with spatial regressions

E.g. https://www.emilyburchfield.org/courses/gsa/spatial_regression_lab image

so I guess that I should have not worried about the residuals (please confirm that)

  1. Validation From what you said above I cannot use the classical Rsquared but Can I use the Nagelkerke pseudo-R-squared to assess my model? I would look into this metric for the model built on the entire dataset to have an idea of the variance explained by model. In my case the summary returns a Nagelkerke pseudo-R-squared of 0.92675 which is very good to me. Please confirm the interpretation of this metric. To complete my validation I performed a spatial cross validation buildig a function by hand (hereunder an example with 2 folds)

Cross validation and RMSE

Consider three spatial clusters based on the points spatial coordinates using k-means clustering algorithm.


cluster_folds=spatial_clustering_cv(out_arid_sf, v = 2)
autoplot(cluster_folds)

image

Transform the spatial clustering object into a dataframe to applu the function

nfolds=nrow(cluster_folds)   # Retrieve the number of spatial clusters 

spatial_cv_dataset=map(1:nfolds, \(x) get_rsplit(cluster_folds,index=x)|> assessment() |> 
                         mutate(fold = x)) |> list_rbind()

head(spatial_cv_dataset)
tail(spatial_cv_dataset)

RMSE=function(observed, predicted){
  rmse=mean((observed - predicted)^2) |> sqrt()
  return(rmse)
}

spatial_cv_fun=function(k, sf,d2,lw,y){

  # Define the training and testing dataset 

  training=filter(sf, fold != k)
  testing=filter(sf, fold == k)

  # Build the neighboruing matrix on the training dataset 

  training.nb=spdep::dnearneigh(training, d1 = 0, d2 = d2)
  training.lw=nb2listw(training.nb, style = "W", zero.policy = TRUE)

  model_training=spatialreg::errorsarlm(formula= bestAIC_or, # Provide the model formula 
                                        data = training, 
                                        training.lw , zero.policy = TRUE)

  fitted=predict(model_training, newdata=testing,listw=lw)[,1] # fist coloumn of the predict object contains the     fitted values
  observed=testing |> dplyr::select(y) |> st_drop_geometry() |> pull() # extract the observed values

  rmse=RMSE(observed=observed,predicted=fitted)  # Provide the info on the column where the observed values are stored
  return(rmse)

}

Try out the function on 1 fold


  # Convert into sf objects given that this info was lost when creating the spatial folders 

spatial_cv_dataset=st_as_sf(spatial_cv_dataset)

spatial_cv_fun(k=2, # N of folds
               sf=spatial_cv_dataset, # dataset
               d2=50000,  # Upper distance of neighboring points
               lw=list_spatial_lm$out_arid.lw, # lw object built on the entire dataset
               y="focal_sum_ghe"  # outcome variable
               )

image

This is just an example but I want to create 10 k folds and compute the RMSE for every k.

What do you think about RMSE as a metric to validate the spatial error model?

  1. Another question that just come to my mind is .. how can I extract the 95% confidence intervals of the predictions ? It seems that the predict function does not support the option interval

Thank you for this discussion Angela

rsbivand commented 8 months ago

What are your data, count of observations and covariates? What is the support of the data, polygon or point? And what are you trying to model? In most cases, the relative fit of a model is less important than the coefficient estimates and the standard errors of those coefficients. One would effectively never try to cherry-pick independent variables. If the underlying model specifies them, they should remain. Prediction is not even a minor goal. In a SEM, the spatial dependency does not enter the prediction. No standard errors or intervals of prediction are available - see the article.

anjelinejeline commented 8 months ago

Hi @rsbivand thanks for the rapid response, All variables and the outcome were normalized between 0 and 1 (I used the min-max normalization method).. the data are spatial points.

With regard to your answer .. I am really sorry but I am afraid I do not follow you. Could you please clarify trying to address the two doubts about diagnostics (residuals assumptions if should be checked) and validation (using the Nagelkerke pseudo-R-squared for the variance explained and compute the RMSE with a spatial cross validation approach )

Additionally I think that I am even more confused now :/

What do you mean with In a SEM, the spatial dependency does not enter the prediction.

rsbivand commented 8 months ago

Most of the purpose of spatial econometrics as originally conceived was to correct the standard error estimates of regression coefficients which are biassed when spatial autocorrelation is present but not modelled. Prediction was not discussed in any articles at all pre-2000, and only studied in any depth in https://doi.org/10.1080/17421772.2017.1300679. Haining (1990) does touch briefly on prediction, and also on diagnostics (influence plots, outliers), but there are no implementations in his work. Without software, no applications. The general practice has been either to fit using GMM/STSLS or ML/Bayesian methods. The LM tests came as a supplement to Moran's I for regression residuals, and all the tests respond to any mis-specification, not just spatial autocorrelation.

I inquired in https://doi.org/10.1007/s101090300096 about how prediction might be fashioned. The only trace that followed some time later was the realisation that impacts in the spatial lag and spatial Durbin models are not the same as in OLS, that is that the change in y_i from a unit change in x_ij is not \beta_j but a function of \beta_j and the autoregressive coefficient \rho.

In the spatial error model, there is no such feedback, and dy/dx_j is \beta_j. If \beta_j differs between OLS and SEM, a Hausman test will show whether the difference is caused by further mis-specification https://doi.org/10.1016/j.econlet.2008.09.003. LeSage & Pace propose that the SDM or SDEM would be better specified than SEM if the null hypothesis of no difference between OLS and SEM \beta is not accepted.

It has never been the usual practice to attempt to try to make fitted models fit better, because almost always the model is fitted to all the data. This may seem unlike other areas especially of non-spatial statistics. In spatial statistics, only geostatistics has been concerned with prediction, but there not for model fitting as such.

It is possible to predict using spatial econometrics type models using INLA, but for your purposes it is not even clear why using spatial econometrics makes any sense.

The link to an old class is out of date and does not reflect current practice, for moderately current practice see https://journal.srsa.org/ojs/index.php/RRS/article/view/44.1.2. See also https://doi.org/10.1111/jors.12188. Here too one would assume that the data entering the model would satisfy all the assumptions of OLS with the main exception of residual spatial autocorrelation.

I asked how many observations you have, how many variables are included in the model; I understand that the observations have point support and that the points are expressed by decimal degrees and cover the globe (figure above). There seem to be very many points, but you use the default estimation method for ML fitting, which is very likely highly sub-optimal. Without knowing what you are trying to do, I fear that you are using the wrong tool completely. If you have global point support data, and want to find a model that predicts best for moderately large data sets, but have no theoretical model informing your choice of covariates, then spatial econometrics is probably inappropriate. I have no idea what the response is. I thought the data might be real estate prices, but they would not be global.

Please tell me what you are trying to do, and I can see whether I can explain why your expectations of spatial econometrics models are inappropriate. This is not a problem with the software, which behaves as specified, but is a problem with your using spatial error models for purposes for which they were never intended.

anjelinejeline commented 8 months ago

Thank you for this comprehensive explanation. Indeed I have a very large dataset. Which method should I try if not ML?

  1. With regard to my data the original outcome variable is a count variable: number of events occuring in a location which I weighted considering the expenditure of the country where the locations are. This weighted variable was then normalized between 0 and 1 along with all covariates. Do you suggest transforming them instead of making them on the same scale to meet the assumptions of the OLS before even building a spatial regression?

  2. Why do you think that the model may not be appropriate?

  3. The goal is to find a relationship between the covariates and the outcome ..it's not to predict in the strict sense on a new scenario but I wanted to use the predict function to validate the model and use the fitted values on the testing k fold to calculate the RMSE. Are you saying that the fitted values returned by the predict function may be wrong if a spatial error model is passed ?

  4. It still not clear to me if I can use the Nagelkerke pseudo-R-squared for the variance explained

Thank you for your feedback

rsbivand commented 8 months ago

ML or Bayesian may be OK if you choose a different method= argument value than the default; ML or Bayesian permit the use of goodness of fit comparisons like likelihood ratio, GMM/STSLS do not.

  1. If your data are counts, use rather poisson or negative binomial. I suspect that your "weighting" is actually dividing by some quantity, which in GLMM terms would be an offset. If you are an economist and are looking at elasticities (probably are not) the log link in a GLM would be something like that. I never rescale any covariates (or the response) because the coefficients cannot be interpreted in measured units.

  2. If you are only interested in accounting for any residual autocorrelation in the model, use for example mgcv::gam with an "mrf" smooth; this also permits the proper use of count variables. You do not need the specific SEM model you chose, rather you need a more standard model permitting an additive spatial smooth, as far as I can tell from the very vague description of your research question and what the data actually represent. If you are an economist and arre concerned that reviewers in narrow economics journals will require use of SEM, try a different journal.

  3. If predictions are not important, then the only thing that matters is the precision of the coefficient estimates. The precision (the size of the standard error estimates) is affected by unmodelled residual spatial autocorrelation, which biasses it downwards. Briefly, if there are n observations, which are independent, (n-k) or for ML (n-1) is a reasonable number with which to divide the sum of squared errors. If the residuals show positive spatial autocorrelation, there are not n independent observation, but rather n* << n, because they are dependent on each other. The loss of effective degrees of freedom pulls the rug out from under any RMSE or similar, probably also with spatial folds. There is no helpful literature of which I am aware showing that CV is any use in areal spatial data analysis. If you need to compare models, Bayesian model averaging is at least represented in the literature. The term "validating the model" really does not apply as such when you are fitting models to observational data which is very unlike random samples.

  4. No, the appropriate measure is the likelihood ratio test between nested spatial econometrics models if fitted with ML. The Nagelkerke measure is 1 - exp(-(2/n)*(logLik(obj) - nullLL)) where obj is the fitted model and nullLL is the log likelihood of an OLS model with the same response and only the intercept (all we know about the response is its mean), so a function of the log likelihoods. The log likelihood of SEM takes account of the presence of spatial autocorrelation as well as a function of the sum of squared errors, for OLS, only a function of the sum of squared errors. Using a likelihood ratio test compares the goodness of fit of nested models, pseudo-R-squared is just a number.

anjelinejeline commented 8 months ago

Thank you for your rapid response, If I go for a GAM model and use the offset option ..would you rescale the outcome variable and the offset variable between 0-1 or would it be better to rescale just the covariates?

rsbivand commented 8 months ago

I never rescale any variables, there is no reasonable motivation other than possibly avoiding numerical problems. If the natural scales of covariates or reponse lead to numerically small or large coefficients, I multiply or divide by a function of 10 (say convert metres to kilometres). That keeps the coefficient values numerically stable and directly interpretable.

anjelinejeline commented 8 months ago

Thanks for your response and sharing your thoughts on this .. Can a spatial negative binomial regression be built with spatreg? I am not sure whether it is supported .. it seems that it does support only lm ...I rather prefer a classical statistical approach rather than a GAM

anjelinejeline commented 8 months ago

BTW these articles predict using spatial econometrics models

https://www.nature.com/articles/s41598-017-08254-w https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0282524

and also re-reading the book you pointed me too they actually used the predict function ... https://r-spatial.org/book/17-Econometrics.html#sec-spateconpred

I apologize but it's still not clear to me why it is not possible to predict using these models :/ :(

rsbivand commented 8 months ago

You can fit poisson and negbin models using many of the disease mapping packages (unsure about negbin). Most are Bayesian (CARBayes, etc.), but hglm can also be used. The big difference between spatial econometrics models (from Ord 1975, via Anselin 1988 to LeSage & Pace 2009 and onwards) is that SE models do not admit a spatially structured random effect in addition to the residuals. SE starts from y = X b + e; disease mapping adds a spatially structured random effect (u): y = Xb + u + e. In practice there isn't much difference, but the two traditions separated in the 1970's (Ord or Besag). You can see this in Dumelle et al.'s equation 2.

spatialreg provides spatial econometrics-type model fitting functions, not because they are better, but because that is what was needed for teaching spatial econometrics. Lots is easier with the disease mapping-type models.

spmodel is a new package, and so far I haven't checked what is comparable with spatialreg (probably only SEM). I don't see non-Gaussian responses.

Yes, spatialreg::predict works correctly, but the row.names of the listw object (in the region.id attribute) and the fitting data and the new data must agree, in order to find and use feed-through from data to new-data observations. That is why your problem is only practical, and why I asked for a complete minimal reproducible example.

anjelinejeline commented 8 months ago

Thank you for your rapid response 😀

For the predict issue now it's more clear and I was able to use it in the cross validation and compute the RMSE as I posted above.

One last doubt remains with regard to the pseudo Rsquared .. why do you say that it's just a number ? How can it be interpreted ? If it's in the summary there's a reason .. also in glm I use a pseudo R squared and I interpret it as as a measure of improvement over the null model .. so I am confused that you said that I cannot use it as a metric for the goodness of fit

rsbivand commented 8 months ago

Yes, the pseudo R squared is a measure of improvement over the null model, but its scaling probably isn't comparable with the linear model coefficient of determination. And comparing across models isn't obvious.

anjelinejeline commented 7 months ago

Got it .. Many thanks for this interesting discussion :)