Open MgeeeeK opened 4 years ago
Great start!
A couple more bits from the discussions we've had:
xarray.DataArray
objects to libpysal.W
and aligned pandas.Series
objects. In this context, it might be good to think of which ways could the API be designed to be as user-friendly and as integrated with other projects in the eco-system as possiblenumba
can be of helpA quick post to elaborate on the optimization phase, once we have the API ironed out. My suspicion is the main area to improve will be building weights efficiently. Efficiently in this context means scalable (using as little memory as possible) and fast. Here're two steps to take:
lat2W
/lat2SW
into a method that only includes elements for which the DataArray
has (non-missing) data. This would require a new method, to go in raster.py
that we could codename raster2W
/raster2SW
lat2W
/lat2SW
(which would improve their applications in other areas beyond raster where the methods are used); and raster2W
/raster2SW
, which would be the ideal solution for the raster project.@MgeeeeK / @slumnitz does this sound sensible/feasible? If so, maybe we should move these parts into separate issues to track progress on them independently?
This sounds promising @darribas ...
I have to figure out the logic for both methods raster2W/SW
, after finalizing the interface's API this week and adding tests I'll look into the possible logic of creating weights object with only non-missing data from start which will be more efficient than current method.
Side-by-side I can start working on incorporating numba to speed up existing methods.
More ideas for the optimisation phase:
libpysal
currently has higher_order
, which will power a matrix so the neighbors of a neighbor are also a neighbor. It's unclear how this scale to raster-type sizes, so it might be worth at least checking a) how fast this is and b) whether there's room for optimisationKNN
weights, which appears to be supported. Does it scale? Is it fast? Can we make it (even more) scalable/faster? Peeking into the implementation, it converts cell centroids into points and they're shipped directly into our KNN builder. This can probably be optimised for rasters as the raster case is a simpler version where you probably don't need a KDTree as everyone is equally spaced.@darribas just to clarify as I was dealing with it today in #313. higher_order
does not make neigbors of neighbor also a neighbor, but makes them only neighbors. It ignores lower-order contiguities at the moment.
To add some background, the original implementation of higher_order follows from the logic in Anselin and Smirnov (1996).
Given our call the other day, I wanted to provide some background for the statements I was making about network representations. I knew they were stuck way deep down somewhere from my time early in my PhD when I was still pursuing the dual degree between Industrial Engineering & Geography. Ahuja, Magnati, & Ohrlin. Network Flows: Theory, Algorithms, & Applications, is a wonderful book on graph theory from an industrial engineering perspective. Chapter 2 has this discussion on network representations. Lots of it is quite outdated (e.g. performance of implementing these in C++/FORTRAN will be different for us) but conceptually, these are useful:
In their language:
to_adjlist()
constructs AMO's forward star adjacency list, a single array containing source, sink, *metadata
.sparse
matrix is AMO's node-node adjacency matrix in a form that is efficient for sparse networks. Our neighbors, weights
dict strategy approaches the singly-linked adjacency list from AMO, with a few key differences:
neighbors[i][j]
is reversible to neighbors[j][i]
, since nodes might have string-based names (so i
can definitely be a string) but their ordinal position in neighbors[i]
is integer-valued (so, j
must be an int). That is, for neighbors[i][j]
, i
is a node's "name" but j
is the index of node j
in i
's neighbor list. To approximate this reversibility property in AMO's singly-linked list, we can just use j_name = neighbors[i_name][j_idx]
, if j_idx
is known in advance. But, for real reversibility, we would need dict-of-dicts or some other index of how to map from j_index_in_i_neighbors
to j_name
. neighbors[i_name][j_ix]
, the corresponding weight also has to be deleted in weights[i_name][j_idx]
. in AMO, you delete the cell of the linked list & all information about that link is removed.Thanks @ljwolf, this is really handy for me to get clear picture of W
object.
@MgeeeeK I'd open a separate issue just to fully flesh out the new structure of WSP
objects and track its implementation for GSoC'20
This issue sheds some light on the discussions I'd during
raster awareness..
project calls. Some of the main points discussed are listed below:xarray
will be used in place ofrasterio
to access raster files.xarray.dataArray
object to PySAL'sW
andWSP
object during initial week.util.py
)Discussion also involved brainstorming for features and functionality to be incorporated after phase 1:
splot
Dask
for handling large rastersThis issue will also be used for future discussion related
raster awareness
project and is open for discussion!, any suggestions or feedbacks are most welcome.