pysal / esda

statistics and classes for exploratory spatial data analysis
https://pysal.org/esda
BSD 3-Clause "New" or "Revised" License
215 stars 55 forks source link

NaN attribute values cause weird behaviour with Local Moran's I and Local Geary #215

Open waeiski opened 2 years ago

waeiski commented 2 years ago

I am not sure whether what I've encountered is a bug or expected behaviour with data like mine.

I am trying to perform a Local Moran's I analysis on a population variable in a vector file that consists of 250 meter grids. The output gives every cell a pseudo p-value of 0.0001, a classification 3 (or 'LL'), and the Moran's I value for every grid cell are NaN. The grid vector file does not contain any islands, but a majority of the cells have NaN values as not all cells are populated. The spatial weights are constructed without errors and this behaviour happens regardless of the type of spatial weights (Queen, Rook and KNN). Same happens if I switch to local Getis-Ord, the G-values are all NaN. I think that esda's handling of NaN/None values might need to be checked. I could not find any mention on how I am supposed to deal with vector files with many NaNs in the attribute fields in the PySAL tutorials, or libpysal/esda documentation, so I am not 100% sure whether this is a bug or expected behaviour with this type of data.

For what it is worth, GeoDA does not produce the same weird result with the exact same data. GeoDA classifies cells with NaN values as "Undefined", so the problem shouldn't be with my data.

Last but not least, I am running PySAL 2.3.0 with esda 2.3.1. The older versions are because, I work in a remote "secure computing environment" so updating anything costs money. However, I am not sure whether this behaviour is the same that was solved in: https://github.com/pysal/esda/pull/196. If this behaviour has been solved in that PR or elsewhere, I guess I have to dish out some project funding.

ljwolf commented 2 years ago

Hey, thanks for the report!

In the past, we had agreed as a library to not handle missing values. The solution you link to (#196) would not "solve" the problem you're raising, since it would simply error by saying you have NaN/None values. So, I don't think upgrading will do what you want.

The "best" solution would probably be for us to set all "missing value" rows of the weights matrix to Zero, so that they never enter into any of the computations. I'm not sure if that would have ramifications down the line at the moment, but you can try that.

A different easy workaround for you would be to replace any missing values with their lagged value, and then mask them out again afterwards. So,

from pysal.lib import weights
import numpy
import esda

y_smoothed = numpy.copy(y)
nan_values = numpy.isnan(y)
y_smoothed[nan_values] = weights.lag_spatial(W, y)[nan_values]

lmo = esda.Moran_Local(y_smoothed, W)

Is = numpy.copy(lmo.Is)
Is[nan_values] = numpy.nan
p_sim = numpy.copy(lmo.p_sim)
p_sim[nan_values] = numpy.nan
q = numpy.copy(lmo.q)
q = q.astype(str)
q[nan_values] = "Undefined"
darribas commented 2 years ago

To add to Levi's suggestion, a couple of thoughts:

  1. You say some cells are "unpopulated". Does that mean the value is actually 0? If so you could fill that in and run the LISA on the fully dense dataset?
  2. If those are not zero, it might be more efficient to remove them from the data before computation. Two ways:

a. Remove them before building your weights

# Using Levi's notation and objects where possible and assuming the geographies and y are part of 
# a GeoDataFrame called `db`
w_sub = weights.Queen.from_dataframe(db.loc[~nan_values, :])

lmo = esda.Moran_Local(y[~nan_values], w_sub)

b. Subset w after creation

w_sub = weights.w_subset(w, y[~nan_values].index)

lmo = esda.Moran_Local(y[~nan_values], w_sub)

The above assumes you built w from a table indexed as y (e.g., if y was a column in db).

These two approaches are equivalent to computing LISA on a geography with "holes", which is a slightly different assumption than you would by smoothing as Levi suggests. Both are second bests, but in slightly different way. Which one is preferrable I'd say depends on your use case?

waeiski commented 2 years ago

Thanks @darribas @ljwolf for the suggestions and clarifications. I will try setting to zero, masking and removing and see how they differ in terms of results. If it helps the development or documentation work of PySAL, I can report back the differences here.

As for Dani's questions: You are correct. The cells with NaNs are unpopulated so 0 people live in these cells (e.g. industrial areas, forests, lakes, sea etc.). I'll see whether populating the NaNs with 0's causes population e.g. "cold spots" to occur in these unpopulated areas, which in a sense is correct, but it is not what I am after. I would like the cold spots to indicate cells that have few inhabitants instead of none at all.

darribas commented 2 years ago

I would like the cold spots to indicate cells that have few inhabitants instead of none at all.

I'd then probably go for subsetting to only those you know have population as that is your "universe". Filling them with zeros or smoothing as Levi suggests would implicitly set the analysis to all areas, including those with no population. If that is a large group of cells, I imagine all you'd get is a map of empty areas and areas with someone on them.