`WSP` builder from rasters returns integers only

darribas commented 2 years ago

The recommended WSP builder in the raster section of weights seems to return WSP objects that can only have int8 values. This makes if fail, for example, on computations that require products like LISA statistics.

From the raster demo:

from libpysal.weights import Queen, raster
import numpy as np
import xarray as xr

from esda import Moran_Local

ds = xr.tutorial.open_dataset("air_temperature.nc")  # -> returns a xarray.Dataset object
da = ds["air"]  # we'll use the "air" data variable for further analysis

da = da.groupby('time.month').mean()

coords_labels = {}
coords_labels["z_label"] = "month"  # since month does not belong to the default list we need to pass it using a dictionary
w_queen = Queen.from_xarray(
    da, z_value=12, coords_labels=coords_labels, sparse=False)

w_queen is a full W and seems to work as expected:

w_queen.transform = 'r'
np.unique(w_queen.sparse.data)
> array([0.125     , 0.2       , 0.33333333])

If we build it straight into WSP, which is much faster and efficient (and the current default):

w_queen = Queen.from_xarray(
    da, z_value=12, coords_labels=coords_labels, sparse=True)
w_queen.transform = 'r'
np.unique(w_queen.sparse.data)
> array([1], dtype=int8)

Any ideas why it does not switch to floats to allow for the weights to adjust?

cc' @MgeeeeK as they might have some suggestions since they worked on this intensively.

MgeeeeK commented 2 years ago

Probably a bug, lemme check!

ljwolf commented 2 years ago

Hmm... da2wsp() doesn't deal with standardization, so that option is ignored when sparse=True since it is propagated down to the da2W() and da2WSP() functions using splatting (**kwargs).

Should be a one-line fix either in da2WSP or directly in the .from_xarray() method... probably best in da2WSP(). In theory.... we should have a centralized "standardizer" mixin class 😄

ljwolf commented 2 years ago

Probably a bug, lemme check!

whoops, should've left this for you @MgeeeeK, sorry :/ just saw in morning email & recalled the standardization options...

darribas commented 2 years ago

int8 is hardcoded here:

https://github.com/pysal/libpysal/blob/0f03a313896de6af096ff92105e54da86bb78368/libpysal/weights/raster.py#L635

I think that makes sense as a default as it's more memory efficient, but I thought that'd convert directly into other types (eg. float64) if multiplied. Perhaps that is not occurring?

I think the ideal behaviour is that by default the binary weights are made of int8, but if we transform the weights (e.g. row-standardise or spatial lag even) it'd accept the operations.

MgeeeeK commented 2 years ago

whoops, should've left this for you MgeeeeK, sorry :/ just saw in morning email & recalled the standardization options...

https://github.com/pysal/libpysal/blob/0f03a313896de6af096ff92105e54da86bb78368/libpysal/weights/raster.py#L301

no problem! I think you are correct, sparse weights does not support transformations rn. (afaiu)

I think the ideal behaviour is that by default the binary weights are made of int8, but if we transform the weights (e.g. row-standardise or spatial lag even) it'd accept the operations.

that makes sense, will look into it

ljwolf commented 2 years ago

A similar issue affects lat2SW https://github.com/pysal/libpysal/blob/master/libpysal/weights/util.py#L1248

darribas commented 2 years ago

Would it make sense to have the dtype optionally chosen by the user with a default to int8 and/or an 'auto' option that picks up the dtype of the DataArray? This might not be very efficient at scale but, as it turns out, it does make computations down the line easier (e.g. numba in the LISA complains if the W is expressed in float32 and the y is expressed as float64). Should we maybe even default to float64 and if you know what you're doing you can set it to int8 if you want things to run lean?

darribas commented 2 years ago

I've given it a bit more thought and fleshed out some ideas in this notebook. There's a quick sketch of how we could go about aligning the dtypes of y and w when running things like Moran_Local. It relies on a function aligner that takes the following arguments:

'''
    y
    w
    how : str/dtype
         [Optional. Default='y'] Alignment policy:
         - 'y': use `y.dtype`
         - 'w': use `dtype` in `w`
         - `dtype`: different `dtype` target
'''

I'll copy here my current thinking (from the bottom of the notebook):

Something like aligner could be embedded in Moran_Local and all methods that rely on numba, but this would need to be added after any transformations and on each of the classes that currently rely on accelerated randomisation
My general sense is to have a sensible default (either y.dtype or float64) and leave flexibility as an option if folks need it
Something to keep in mind that I've not thought too much about beyond seeing it might be an issue is whether there are more memory efficient ways of doing all this. If w is a large matrix, the approach taken in _convert_w_dtype might not be feasible/desirable. Do we have other ways?
At any rate, if aligner was called within something like Moran_Local, warnings would need to be raised so the user is aware. In fact, one option in Moran_local could select what to do, with the option of raising an error if they're different, or converting dtypes automatically.

I think there's two aspects to this issue, one inmediate that we could fix along the lines suggested, and a longer term one that I think should be addressed in how W objects are stored and manipulated, perhaps in a new iteration of the current W/WSP objects.

pysal / libpysal

`WSP` builder from rasters returns integers only #446