makepath / xarray-spatial

Raster-based Spatial Analytics for Python
https://xarray-spatial.readthedocs.io/
MIT License
805 stars 81 forks source link

Xrspatial Proximity doesn't work with large rasters even with dask backed array #773

Open GrahamReveley opened 1 year ago

GrahamReveley commented 1 year ago

Xrspatial not working with large dask backed arrays I believe there's a bug in the xrspatial.proximity method whereby the coordinates for a large array (even if its a dask array) are loaded into memory as a numpy array instead of there being a dask array. For context I'm attempting to compute a distance to coast for a 90m global dataset. I had a look into the source code and I believe the issue is here (in the _process function of proximity):

` raster_dims = raster.dims if raster_dims != (y, x): raise ValueError( "raster.coords should be named as coordinates:" "({0}, {1})".format(y, x) )

distance_metric = DISTANCE_METRICS.get(distance_metric, None)
if distance_metric is None:
    distance_metric = DISTANCE_METRICS["EUCLIDEAN"]

target_values = np.asarray(target_values)

# x-y coordinates of each pixel.
# flatten the coords of input raster and reshape to 2d
xs = np.tile(raster[x].data, raster.shape[0]).reshape(raster.shape)
ys = np.repeat(raster[y].data, raster.shape[1]).reshape(raster.shape)

`

Therefore XS and YS are huge numpy arrays that don't fit into memory, whereas, if the input data is a dask array these should probably be dask arrays rather than numpy arrays. Later on in the processing sequence the proximity calculation is either done using dask or numpy and I think there should be a similar thing here for dask/numpy processing.

If I've missed something please let me know, more than happy to share some more code if needs be and in the mean time I can use gdal_proximity.py directly but I guess it would be slower than using the dask backed xrspatial.

Thanks!

bstadlbauer commented 1 year ago

Also ran into this 👍

brendancol commented 1 year ago

@GrahamReveley @bstadlbauer thanks contributing here and I believe this is a known issue to @thuydotm, but hasn't been formally documented. also sorry for delay in responding.

We are going to prioritize this proximity over the next sprints, but happy to help review / test a PR if y'all want contribute a fix.

bstadlbauer commented 1 year ago

@brendancol We do have a constrained usecase and are working with a changed fork of the proximity code here, but IIRC that's nothing that we can easily generalize to work for all usecases