Closed eroubenoff closed 3 years ago
Hi @eroubenoff ,
Thank you for providing sample data and code, it does make it much easier from this side.
There are 2 different issues going on here (both brought to my attention by @lanselin , thank you!).
1) Regarding the different regression results, PySAL does not row-standardise the weights matrix by default, whereas R does. So you are actually estimating models with different weights matrices: binary in PySAL, row-standardised in R. You need to explicitly row-standardise your weights matrix in PySAL.
2) When plotting the residuals, you are using u
, which are 'raw' residuals that, as seen in your model's results, are indeed spatially dependent. That's why you see the spatial dependence in the plot, as would be expected from spatially correlated residuals. If you want, you can plot the spatially filtered residuals instead, which are stored in the attribute e_filtered
, as shown in the function documentation.
Additionally, if you use large data sets, keep in mind that GeoDa has better performance in creating weights matrices and estimating ML models than PySAL and R.
I've adapted your code to fix these issues and the results were as expected. Please see below.
I hope this helps!
import spreg
from libpysal import weights
import geopandas as gpd
import pandas as pd
gdf = gpd.read_file("sim_data.shp")
gdf = pd.get_dummies(gdf, columns=['X1','X2'])
y = gdf[['Y']]
x = gdf[['X1_X1_Treatment','X2_X2_Treatment']]
w = weights.Queen.from_dataframe(gdf)
w.transform = 'r' # Row-standardising the weights matrix. This is the line you were missing.
error_model = spreg.ML_Error(y.to_numpy(), x.to_numpy(), w=w, name_y=y.columns[0], name_x=list(x.columns))
print(error_model.summary)
#Plot error_model.e_filtered instead of error_model.u:
gdf.plot(column = error_model.e_filtered.reshape(1,-1)[0], legend = True)
REGRESSION
----------
SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL ERROR (METHOD = FULL)
-------------------------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : Y Number of Observations: 6875
Mean dependent var : 137.0773 Number of Variables : 3
S.D. dependent var : 27.0418 Degrees of Freedom : 6872
Pseudo R-squared : 0.0378
Sigma-square ML : 214.730 Log likelihood : -28756.276
S.E of regression : 14.654 Akaike info criterion : 57518.553
Schwarz criterion : 57539.060
------------------------------------------------------------------------------------
Variable Coefficient Std.Error z-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 134.1806633 1.6695505 80.3693333 0.0000000
X1_X1_Treatment 7.2093079 1.8528979 3.8908285 0.0000999
X2_X2_Treatment -2.5719855 1.9584170 -1.3132982 0.1890825
lambda 0.8706089 0.0076531 113.7592507 0.0000000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================
Thank you very much @pedrovma! (and @lanselin -- I'm a big fan!). This resolves my issue and I appreciate the prompt response.
I am running a series of spatial regression models on raster data and am getting inconsistent results. I have previously worked with R's
spatialreg
package which I am using as a comparison. In summary, the estimated coefficients are different and the residuals ofpysal.ML_Error
appear to be substantially autocorrelated. Is there another argument or control necessary to get the expected behavior from pysal?I am working with a simulated dataset with two categorical treatment effects and a continuous outcome. The treatments are wide and uniform with mean difference of 5 each. A spatial random field is also added. The shapefile for replication is attached below. Results here are for the full matrix ML models, but similar behavior is observed for Ord eigendecomposition and LU decomposition
First, there are substantial differences in estimated coefficient standard errors:
pysal.ML_Error
output:spatialreg::errorsarlm
output:Additionally, the residual map from
pysal.spreg
(left) appears to remove almost no autocorrelation from the data. Contrast this with the algorithmically identical R call (right). My apologies that they are not on the same color scale.Replication code:
Shapefile of simulated data: sim_data.zip
pysal.model.spreg
version: 1.1.0python
version: 3.7This is the python script I am using:
And the R equivalent: