Spatial dependence diagnostics returns nan

sjgiorgi commented 4 years ago

I'm trying to run a simple spatial regression (following the tutorial here) but I'm getting nan variables for the Spatial dependence diagnostics. I'm not sure why. No warnings are being printed.

Any idea why I am getting nan values?

I can successfully calculate Moran's I and everything else in the OLS call works, so in some sense my data is setup correctly.

Code:

q = pysal.rook_from_shapefile('/home/user/UScounties/UScounties.shp', idVariable='FIPS')

# read in three variables from mysql: FIPS, my_var, perc_republican_vote_2016
df = pd.read_sql(sql, db_eng)
df['FIPS'] = df['FIPS'].astype(int)
df['FIPS'] = df['FIPS'].astype(str).str.rjust(5, '0')

# get subset of FIPS codes in my data set
subset = w_subset(q, df['FIPS'].values)

# calculate Moran's I
mi = pysal.Moran(df['my_var'], subset)
print(mi.I, mi.p_sim) # 0.468650371859 0.001

# zscore the data
df['my_var_z'] = (df["my_var"] - df["my_var"].mean())/df["my_var"].std(ddof=0)
df['perc_republican_vote_2016_z'] = (df["perc_republican_vote_2016"] - df["perc_republican_vote_2016"].mean())/df["perc_republican_vote_2016"].std(ddof=0)

m1 = pysal.spreg.OLS(df[['perc_republican_vote_2016_z']].values, df[['my_var_z']].values, \
                  w=subset, spat_diag=True)
print(m1.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:        2031
Mean dependent var  :      0.0000                Number of Variables   :           2
S.D. dependent var  :      1.0002                Degrees of Freedom    :        2029
R-squared           :      0.2499
Adjusted R-squared  :      0.2495
Sum squared residual:    1523.430                F-statistic           :    676.0143
Sigma-square        :       0.751                Prob(F-statistic)     :  7.085e-129
S.E. of regression  :       0.867                Log likelihood        :   -2589.843
Sigma-square ML     :       0.750                Akaike info criterion :    5183.686
S.E of regression ML:      0.8661                Schwarz criterion     :    5194.918

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       0.0000000       0.0192272       0.0000000       1.0000000
               var_1      -0.4999115       0.0192272     -26.0002756       0.0000000
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER            1.000

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2         178.764           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                1          14.488           0.0001
Koenker-Bassett test              1          11.304           0.0008

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Lagrange Multiplier (lag)         1             nan              nan
Robust LM (lag)                   1             nan              nan
Lagrange Multiplier (error)       1             nan              nan
Robust LM (error)                 1             nan              nan
Lagrange Multiplier (SARMA)       2             nan              nan

================================ END OF REPORT =====================================

I unfortunately cannot share the variable my_var so it can't be reproduced. So I tried reproducing on some open source county level data (from County Health Rankings), and everything works fine.

county_health = pd.read_excel('http://www.countyhealthrankings.org/sites/default/files/2018%20County%20Health%20Rankings%20Data%20-%20v2.xls', 
             'Ranked Measure Data', skiprows=1)[['FIPS', 'Years of Potential Life Lost Rate (White)', '% Uninsured']].dropna()
county_health['FIPS'] = county_health['FIPS'].astype(int)
county_health['FIPS'] = county_health['FIPS'].astype(str).str.rjust(5, '0')

# zscore the data
county_health["uninsured_z"] = (county_health["% Uninsured"] - county_health["% Uninsured"].mean())/county_health["% Uninsured"].std(ddof=0)
county_health["ypllr_z"] = (county_health["Years of Potential Life Lost Rate (White)"] - county_health["Years of Potential Life Lost Rate (White)"].mean())/county_health["Years of Potential Life Lost Rate (White)"].std(ddof=0)

subset = w_subset(q, county_health['FIPS'].values)
m1 = pysal.spreg.OLS(county_health[['ypllr_z']].values, county_health[['uninsured_z']].values, \
                  w=subset, spat_diag=True)
print(m1.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     dep_var                Number of Observations:        1547
Mean dependent var  :      0.0000                Number of Variables   :           2
S.D. dependent var  :      1.0003                Degrees of Freedom    :        1545
R-squared           :      0.1876
Adjusted R-squared  :      0.1871
Sum squared residual:    1256.706                F-statistic           :    356.8890
Sigma-square        :       0.813                Prob(F-statistic)     :   8.836e-72
S.E. of regression  :       0.902                Log likelihood        :   -2034.346
Sigma-square ML     :       0.812                Akaike info criterion :    4072.693
S.E of regression ML:      0.9013                Schwarz criterion     :    4083.381

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       0.0000000       0.0229302       0.0000000       1.0000000
               var_1       0.4331856       0.0229302      18.8915056       0.0000000
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER            1.000

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2         455.789           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                1           1.537           0.2150
Koenker-Bassett test              1           0.734           0.3917

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Lagrange Multiplier (lag)         1         404.995           0.0000
Robust LM (lag)                   1           0.860           0.3537
Lagrange Multiplier (error)       1         447.973           0.0000
Robust LM (error)                 1          43.838           0.0000
Lagrange Multiplier (SARMA)       2         448.833           0.0000

================================ END OF REPORT =====================================

ljwolf commented 4 years ago

Hmm, good question.

what are the versions of scipy & pysal?
Is there any missing data in df?

df[['my_var_z', 'perc_republican_vote_2016_z']].isnull()

If there are, then we don't address this. But... I'd expect missing values to affect the other components of the regression, so do check.

Are there any extra warnings or errors that are thrown (like, the typical RuntimeWarning generated by numpy for divide by zero?)
If you use the lm tests directly, does this behavior occur/are any warnings raised?

from pysal.model.spreg.diagnostics_sp import LMtests

LMtests(m1, subset)

sjgiorgi commented 4 years ago

PySAL==1.13.0 and scipy==0.19.1
No missing data.
No warnings for this method. When I run mi = pysal.Moran(df['my_var'], subset) I get

('WARNING: ', '51580', ' is an island (no neighbors)')

Additionally, I get errors when I run GM_Lag, though I don't suspect it's connected:

m3 = pysal.spreg.GM_Lag(df[['perc_republican_president_2016_z']].values, df[['group_norm_z']].values, \
                  w=subset, spat_diag=True)

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-30-e10554ba14df> in <module>()
----> 1 m3 = pysal.spreg.GM_Lag(df[['perc_republican_president_2016_z']].values, df[['group_norm_z']].values,                   w=subset, spat_diag=True)

/data/anaconda2/envs/dlatk/lib/python3.5/site-packages/pysal/spreg/twosls_sp.py in __init__(self, y, x, yend, q, w, w_lags, lag_q, robust, gwk, sig2n_k, spat_diag, vm, name_y, name_x, name_yend, name_q, name_w, name_gwk, name_ds)
    476             self, y=y, x=x_constant, w=w.sparse, yend=yend2, q=q2,
    477             w_lags=w_lags, robust=robust, gwk=gwk,
--> 478             lag_q=lag_q, sig2n_k=sig2n_k)
    479         self.rho = self.betas[-1]
    480         self.predy_e, self.e_pred, warn = sp_att(w, self.y, self.predy,

/data/anaconda2/envs/dlatk/lib/python3.5/site-packages/pysal/spreg/twosls_sp.py in __init__(self, y, x, yend, q, w, w_lags, lag_q, robust, gwk, sig2n_k)
    175 
    176         TSLS.BaseTSLS.__init__(self, y=y, x=x, yend=yend, q=q,
--> 177                                robust=robust, gwk=gwk, sig2n_k=sig2n_k)
    178 
    179 

/data/anaconda2/envs/dlatk/lib/python3.5/site-packages/pysal/spreg/twosls.py in __init__(self, y, x, yend, q, h, robust, gwk, sig2n_k)
    157         self.k = z.shape[1]
    158         hth = spdot(h.T, h)
--> 159         hthi = la.inv(hth)
    160         zth = spdot(z.T, h)
    161         hty = spdot(h.T, y)

/data/anaconda2/envs/dlatk/lib/python3.5/site-packages/numpy/linalg/linalg.py in inv(a)
    511     signature = 'D->D' if isComplexType(t) else 'd->d'
    512     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 513     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    514     return wrap(ainv.astype(result_t, copy=False))
    515 

/data/anaconda2/envs/dlatk/lib/python3.5/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     88 
     89 def _raise_linalgerror_singular(err, flag):
---> 90     raise LinAlgError("Singular matrix")
     91 
     92 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

The lm tests directly also don't work:

from pysal.spreg.diagnostics_sp import LMtests # <-- slightly different import than what you suggested
a = LMtests(m1, subset) 
print(a.lme, a.lml, a.rlme, a.rlml, a.sarma)

>>> ((nan, nan), (nan, nan), (nan, nan), (nan, nan), (nan, nan))

sjsrey commented 4 years ago

Moving over to spreg.

pedrovma commented 4 years ago

Really hard without being able to replicate. The fact that @sjgiorgi can run the Moran's I doesn't help much, since the MI is being run for the single variable and the LM tests consider the residuals.

The GM_Lag error may be more informative. Usually, this LinAlgError: Singular matrix error in GM_Lag occurs when there is a variable that mimics the weights matrix, so X and WX are perfectly collinear.

@sjgiorgi could you please try to run OLS and GM_Lag using your y variable on a different X and also a different y on your X?

pedrovma commented 3 years ago

Closed due to no response.

pysal / spreg

Spatial dependence diagnostics returns nan #47