Idea collection/discussion for Raster awareness project

MgeeeeK commented 4 years ago

This issue sheds some light on the discussions I'd during raster awareness.. project calls. Some of the main points discussed are listed below:

xarray will be used in place of rasterio to access raster files.
transformational methods for converting xarray.dataArray object to PySAL's W and WSP object during initial week.
performance improvements, memory optimization for the created methods.
Housing/location of the methods are still under discussion (one of the suggestion was to keep them inside util.py)

Discussion also involved brainstorming for features and functionality to be incorporated after phase 1:

functionality to support timeline analysis
support for multi band handling
integration with splot
using Dask for handling large rasters

This issue will also be used for future discussion related raster awareness project and is open for discussion!, any suggestions or feedbacks are most welcome.

darribas commented 4 years ago

Great start!

A couple more bits from the discussions we've had:

The idea is to have it as easy as possible to do a full journey to and from xarray.DataArray objects to libpysal.W and aligned pandas.Series objects. In this context, it might be good to think of which ways could the API be designed to be as user-friendly and as integrated with other projects in the eco-system as possible
Once the API is sketched, and initial implementations are in place, we can start thinking of efficient and fast ways to write the code under the hood (e.g. for building weights from rasters, mayb numba can be of help

darribas commented 4 years ago

A quick post to elaborate on the optimization phase, once we have the API ironed out. My suspicion is the main area to improve will be building weights efficiently. Efficiently in this context means scalable (using as little memory as possible) and fast. Here're two steps to take:

[ ] For scalability, I think the first step is to move from relying on lat2W/lat2SW into a method that only includes elements for which the DataArray has (non-missing) data. This would require a new method, to go in raster.py that we could codename raster2W/raster2SW
[ ] To make things fast, I think the ideal solution is to get a numba-based version of lat2W/lat2SW (which would improve their applications in other areas beyond raster where the methods are used); and raster2W/raster2SW, which would be the ideal solution for the raster project.

@MgeeeeK / @slumnitz does this sound sensible/feasible? If so, maybe we should move these parts into separate issues to track progress on them independently?

MgeeeeK commented 4 years ago

This sounds promising @darribas ... I have to figure out the logic for both methods raster2W/SW, after finalizing the interface's API this week and adding tests I'll look into the possible logic of creating weights object with only non-missing data from start which will be more efficient than current method. Side-by-side I can start working on incorporating numba to speed up existing methods.

darribas commented 4 years ago

More ideas for the optimisation phase:

[ ] Contiguity-based weights will probably provide a very small set of neighbors for local statistics. To provide a solution for this, libpysal currently has higher_order, which will power a matrix so the neighbors of a neighbor are also a neighbor. It's unclear how this scale to raster-type sizes, so it might be worth at least checking a) how fast this is and b) whether there's room for optimisation
[ ] Explore support for KNN weights, which appears to be supported. Does it scale? Is it fast? Can we make it (even more) scalable/faster? Peeking into the implementation, it converts cell centroids into points and they're shipped directly into our KNN builder. This can probably be optimised for rasters as the raster case is a simpler version where you probably don't need a KDTree as everyone is equally spaced.

martinfleis commented 4 years ago

@darribas just to clarify as I was dealing with it today in #313. higher_order does not make neigbors of neighbor also a neighbor, but makes them only neighbors. It ignores lower-order contiguities at the moment.

sjsrey commented 4 years ago

To add some background, the original implementation of higher_order follows from the logic in Anselin and Smirnov (1996).

ljwolf commented 4 years ago

Given our call the other day, I wanted to provide some background for the statements I was making about network representations. I knew they were stuck way deep down somewhere from my time early in my PhD when I was still pursuing the dual degree between Industrial Engineering & Geography. Ahuja, Magnati, & Ohrlin. Network Flows: Theory, Algorithms, & Applications, is a wonderful book on graph theory from an industrial engineering perspective. Chapter 2 has this discussion on network representations. Lots of it is quite outdated (e.g. performance of implementing these in C++/FORTRAN will be different for us) but conceptually, these are useful:

Screenshot from 2020-07-15 10-42-39

In their language:

Our to_adjlist() constructs AMO's forward star adjacency list, a single array containing source, sink, *metadata.
Our sparse matrix is AMO's node-node adjacency matrix in a form that is efficient for sparse networks.

Our neighbors, weights dict strategy approaches the singly-linked adjacency list from AMO, with a few key differences:

our cells have a single value: the index of the neighbor. AMO suggests representing this (basically) as a C struct that contains all information about the edge.
our indices have mixed semantics. For example, cannot assume neighbors[i][j] is reversible to neighbors[j][i], since nodes might have string-based names (so i can definitely be a string) but their ordinal position in neighbors[i] is integer-valued (so, j must be an int). That is, for neighbors[i][j], i is a node's "name" but j is the index of node j in i's neighbor list. To approximate this reversibility property in AMO's singly-linked list, we can just use j_name = neighbors[i_name][j_idx], if j_idx is known in advance. But, for real reversibility, we would need dict-of-dicts or some other index of how to map from j_index_in_i_neighbors to j_name.
weights are separated from indices: if we delete neighbors[i_name][j_ix], the corresponding weight also has to be deleted in weights[i_name][j_idx]. in AMO, you delete the cell of the linked list & all information about that link is removed.

MgeeeeK commented 4 years ago

Thanks @ljwolf, this is really handy for me to get clear picture of W object.

darribas commented 4 years ago

@MgeeeeK I'd open a separate issue just to fully flesh out the new structure of WSP objects and track its implementation for GSoC'20

pysal / libpysal

Idea collection/discussion for Raster awareness project #293