pysal / spreg

Spatial econometric regression in Python
https://pysal.org/spreg/
Other
67 stars 23 forks source link

Inconsistent results with pysal.spreg.ML_Error #81

Closed eroubenoff closed 3 years ago

eroubenoff commented 3 years ago

I am running a series of spatial regression models on raster data and am getting inconsistent results. I have previously worked with R's spatialreg package which I am using as a comparison. In summary, the estimated coefficients are different and the residuals of pysal.ML_Error appear to be substantially autocorrelated. Is there another argument or control necessary to get the expected behavior from pysal?

I am working with a simulated dataset with two categorical treatment effects and a continuous outcome. The treatments are wide and uniform with mean difference of 5 each. A spatial random field is also added. The shapefile for replication is attached below. Results here are for the full matrix ML models, but similar behavior is observed for Ord eigendecomposition and LU decomposition

First, there are substantial differences in estimated coefficient standard errors:

pysal.ML_Error output:

SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL ERROR (METHOD = FULL)
-------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :           Y                Number of Observations:        6875
Mean dependent var  :    137.0773                Number of Variables   :           3
S.D. dependent var  :     27.0418                Degrees of Freedom    :        6872
Pseudo R-squared    :      0.0377
Sigma-square ML     :     218.651                Log likelihood        :  -28813.082
S.E of regression   :      14.787                Akaike info criterion :   57632.164
                                                 Schwarz criterion     :   57652.671

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT     134.0184532       1.4764356      90.7716197       0.0000000
        X1_Treatment       6.5050001       1.7950174       3.6239203       0.0002902
        X2_Treatment      -1.9630525       1.8537134      -1.0589838       0.2896072
              lambda       0.1105057       0.0009583     115.3100728       0.0000000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

spatialreg::errorsarlm output:

Call:errorsarlm(formula = f, data = gdf, listw = lw)

Residuals:
      Min        1Q    Median        3Q       Max 
-51.72530  -9.93936   0.15471  10.08362  56.74445 

Type: error 
Coefficients: (asymptotic standard errors) 
                 Estimate Std. Error z value  Pr(>|z|)
(Intercept)      134.1807     1.6696 80.3693 < 2.2e-16
X1_TreatmentTRUE   7.2093     1.8529  3.8908  9.99e-05
X2_TreatmentTRUE  -2.5720     1.9584 -1.3133    0.1891

Lambda: 0.87061, LR test value: 7071.1, p-value: < 2.22e-16
Asymptotic standard error: 0.0076531
    z-value: 113.76, p-value: < 2.22e-16
Wald statistic: 12941, p-value: < 2.22e-16

Log likelihood: -28756.28 for error model
ML residual variance (sigma squared): 214.73, (sigma: 14.654)
Number of observations: 6875 
Number of parameters estimated: 5 
AIC: 57523, (AIC for lm: 64592)

Additionally, the residual map from pysal.spreg (left) appears to remove almost no autocorrelation from the data. Contrast this with the algorithmically identical R call (right). My apologies that they are not on the same color scale.

Replication code:

Shapefile of simulated data: sim_data.zip

pysal.model.spreg version: 1.1.0 python version: 3.7

This is the python script I am using:

from pysal.model import spreg
from pysal.lib import weights
from esda import Moran
import geopandas as gpd
import matplotlib.pyplot as plt

# Load data
gdf = gpd.read_file("sim_data.shp")

# Create Design Matrix
X1 = list(gdf['X1'].unique())
X1.remove('NTC')
X2 = list(gdf['X2'].unique())
X2.remove('100pctN')
X = X1 + X2

for x1 in X1:
    gdf[x1] = gdf["X1"] == x1
for x2 in X2:
    gdf[x2] = gdf["X2"] == x2

ols_model = spreg.OLS(
    gdf[['Y']].values,
    gdf[X].values,
    name_y = "Y",
    name_x = X
)

print(ols_model.summary)

w = weights.Queen.from_dataframe(gdf)
error_model = spreg.ML_Error(
    gdf[['Y']].values,
    gdf[X].values,
    w = w,
    name_y = 'Y',
    name_x = X)

print(error_model.summary)

fig, ax = plt.subplots(1)
gdf.plot(column = error_model.u.reshape(1,-1)[0], legend = True)
plt.savefig("pysal_err_resid.png")

And the R equivalent:

library(tidyverse)
library(sf)
library(spatialreg)
library(tmap)
library(tmap)

# Load data
gdf = st_read("sim_data.shp")

# Create Design Matrix
X1 <- unique(gdf$X1)
X1 <- X1[X1 != "NTC"]
X2 <- unique(gdf$X2)
X2 <- X2[X2 != "100pctN"]

X <- c(X1, X2)

for (x1 in X1){
    gdf[,x1] <- gdf$X1 == x1
}
for (x2 in X2){
    gdf[,x2] <- gdf$X2 == x2
}

nb <- spdep::poly2nb(gdf)
lw <- spdep::nb2listw(nb)

f <- formula(paste0("Y~", paste(X, collapse = "+")))

ols_model = lm(f, gdf)
print(summary(ols_model))

error_model <- errorsarlm(f, gdf, listw=lw)
print(summary(error_model))

gdf$residuals <- error_model$residuals

tm <- tm_shape(gdf) + tm_fill(col = "residuals", style= "cont") 
tmap_save(tm, file = "R_spatialreg_resid.png")
pedrovma commented 3 years ago

Hi @eroubenoff ,

Thank you for providing sample data and code, it does make it much easier from this side.

There are 2 different issues going on here (both brought to my attention by @lanselin , thank you!).

1) Regarding the different regression results, PySAL does not row-standardise the weights matrix by default, whereas R does. So you are actually estimating models with different weights matrices: binary in PySAL, row-standardised in R. You need to explicitly row-standardise your weights matrix in PySAL.

2) When plotting the residuals, you are using u, which are 'raw' residuals that, as seen in your model's results, are indeed spatially dependent. That's why you see the spatial dependence in the plot, as would be expected from spatially correlated residuals. If you want, you can plot the spatially filtered residuals instead, which are stored in the attribute e_filtered, as shown in the function documentation.

Additionally, if you use large data sets, keep in mind that GeoDa has better performance in creating weights matrices and estimating ML models than PySAL and R.

I've adapted your code to fix these issues and the results were as expected. Please see below.

I hope this helps!

import spreg
from libpysal import weights
import geopandas as gpd
import pandas as pd

gdf = gpd.read_file("sim_data.shp")
gdf = pd.get_dummies(gdf, columns=['X1','X2'])
y = gdf[['Y']]
x = gdf[['X1_X1_Treatment','X2_X2_Treatment']]

w = weights.Queen.from_dataframe(gdf)
w.transform = 'r' # Row-standardising the weights matrix. This is the line you were missing.
error_model = spreg.ML_Error(y.to_numpy(), x.to_numpy(), w=w, name_y=y.columns[0], name_x=list(x.columns))
print(error_model.summary)

#Plot error_model.e_filtered instead of error_model.u:
gdf.plot(column = error_model.e_filtered.reshape(1,-1)[0], legend = True) 
image
REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL ERROR (METHOD = FULL)
-------------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :           Y                Number of Observations:        6875
Mean dependent var  :    137.0773                Number of Variables   :           3
S.D. dependent var  :     27.0418                Degrees of Freedom    :        6872
Pseudo R-squared    :      0.0378
Sigma-square ML     :     214.730                Log likelihood        :  -28756.276
S.E of regression   :      14.654                Akaike info criterion :   57518.553
                                                 Schwarz criterion     :   57539.060

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT     134.1806633       1.6695505      80.3693333       0.0000000
     X1_X1_Treatment       7.2093079       1.8528979       3.8908285       0.0000999
     X2_X2_Treatment      -2.5719855       1.9584170      -1.3132982       0.1890825
              lambda       0.8706089       0.0076531     113.7592507       0.0000000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================
eroubenoff commented 3 years ago

Thank you very much @pedrovma! (and @lanselin -- I'm a big fan!). This resolves my issue and I appreciate the prompt response.