pysal / libpysal

Core components of Python Spatial Analysis Library
http://pysal.org/libpysal
Other
259 stars 78 forks source link

Distributed weights (builder) #547

Open darribas opened 1 year ago

darribas commented 1 year ago

As the new generation of weights are coming into fruition (#534 ), I wanted to drop an issue to collate a proposal and some ideas @martinfleis and I have been fleshing out over the past couple of years. If anything, at least this can be a space to discuss whether it'd make sense to have on the roadmap at all and, if so, how it can be squared up with all the ongoing parallel plans on weights.

Idea: support (at least) part of the weights functionality on parallelisable and distributed data structures

Why: building weights for larger-than-memory datasets would lay the foundation for spatial analytics (i.e., PySAL functionality) at planetary scale. My own take is that in the last ten years, we've gotten away without this because RAM and CPUs have grown faster than the data we were using. I think this is changing because we're able to access datasets significantly bigger (even just at national scale) on which we'd like to run pysal.

How: our proposal (and this is very much for debate too!) is to build functionality that stores weights as adjacency matrices that are stored as dask.dataframe objects that don't need to live in memory in full. To build them, we could rely on dask_geopandas.geodataframe objects. This way has a couple of advantages:

If the above seems reasonable, probably the main technical blocker is defining spatial relationships across different chunks (for geometries within the same chunk it's straightforward), and possibly merge results at the end. I know Martin had some ideas here and there is precedent we built out-of-necessity and very much in ad-hoc ways for the Urban Grammar here (note this is all before the 0.1 release of dask-geopandas so we didn't have a dask_geopandas.GeoDataFrame to build upon. Some stuff here might be redundant or it might be possible to notably streamline it).

ljwolf commented 1 year ago

The new stuff focuses on the adjacency dataframe as its core data structure & keeps methods quite lean to make exactly this possible... or at least we hope!

We'd still need a testbench to profile whether we need to .collect()/.compute() things (like, .lag()) or when a method might try to persist data. I think we should profile this using a standard dask dataframe of an adjacency table and sending it directly to Graph(). I'll add to the roadmap.

For a distributed constructor, I think the existing rook()/queen() will already work on a distributed GeoDataFrame() via its sindex, but vertex_set_intersection() won't. Not sure about KNN/Range queries for KNN/DistanceBand or out-of-core Delaunay() and subgraphs thereof...

martinfleis commented 1 year ago

I think the existing rook()/queen() will already work on a distributed GeoDataFrame() via its sindex

Nope, dask_geopandas.GeoDataFrame does not have the sindex functionality that would work in a distributed way.

jGaboardi commented 1 year ago

I think the existing rook()/queen() will already work on a distributed GeoDataFrame() via its sindex

Nope, dask_geopandas.GeoDataFrame does not have the sindex functionality that would work in a distributed way.

Do you know if it is planned? (or even feasible?)

martinfleis commented 1 year ago

Do you know if it is planned? (or even feasible?)

Planned would be a bit too strong of a word. We have discussed that and concluded that it may be feasible but it may also be better to guide people to use two steps - first sindex query over partitions and then sindex query over geometries within subset of partitions. That is what the current implementation of distributed sjoin is doing and I suppose what @darribas was using in his work on disk-based areal interpolation. Can't promise anything on this front, dask-geopandas is not a priority for any of us at the moment.

darribas commented 1 year ago

Maybe we can see what’s required for the constructors and retrofit it either with a makeshift sindex or feeding it upstream. I’m not sure how it’d interplay with the overlapping computation across chunks…