Open jGaboardi opened 3 years ago
I am all in favour of this. I would even say we should deprecate libpysal.cg
across the ecosystem.
Do you have an idea of a timeline for this? It may make sense to schedule a call and discuss overlaps with momepy. Since we'll be likely refactoring both in some way, it makes sense to do it once and in a super compatible way.
I started updating the Refactor spaghetti
document again. Take a look and let me know how you're thinking about it.
Maybe we can also loop in @tomalrussell and @HTenkanen? Anyone else you can think of?
Hi @jGaboardi - I'd be happy to be looped in to discussions, though I'm not familiar with spaghetti
internals.
The (gdf_nodes, gdf_edges)
tuple or small wrapper object seems like it's the de-facto standard for spatial networks with explicit geometries (and optional attributes) for nodes and edges. You suggest it in your note, snkit.network.Network
is a minimal wrapper, and pyrosm.OSM.get_network
returns a tuple of gdfs.
I'm not sure if this is in scope for this issue, but there are various adaptors to graph/network representations:
momepy
handles gdf<>nx
conversion, including making a graph directly from LineStringsosmnx
works with networkx.MultiDiGraph
and has helpers to/from gdfs osmnx.utils_graph.graph_from_gdfs
pandana
has pandana.network.Network
without explicit edge geometries or other node/edge attributes, but accepts node_x
, node_y
, edge_from
, edge_to
as pandas.Series
and edge_weights
as pandas.DataFrame
. pyrosm
has adaptors to igraph, networkx, pandana, in pyrosm.pyrosm.OSM.to_graph
@tomalrussell awesome!
... though I'm not familiar with spaghetti internals
The spaghetti
internals are based on the archaic pysal.network
. A complete overhaul is what we have been tiptoeing around for more than a year now.
A top priority for me (and other PySAL devs) is integration with current (segregation
-@knaaptime) and proposed (momepy
-@martinfleis) PySAL submodules that would benefit from a unified network representation, which spaghetti
could built off of.
The (gdf_nodes, gdf_edges) tuple or small wrapper object seems like it's the de-facto standard for spatial networks with explicit geometries (and optional attributes) for nodes and edges
I don't necessarily agree with this statement. Probably the most common is the OSMnx graph, which is a networkx.MultiGraph with shapely geometries attached to individual nodes and edges in the graph object. momepy is using exactly the same model, just a bit more relaxed. Both can return (gdf_nodes, gdf_edges)
but only because that is the most natural way how to export Graph to GeoDataFrame(s). None of them uses it to represent the Graph itself and do the computation. To my knowledge, pyrosm does the same thing in osm.to_graph(nodes, edges, graph_type="networkx")
.
When creating a representation of a spatial graph, we are doing so because we want to exploit graph theory in our analysis. I think that to maximise the potential, we should have GeoDataFrames on one side of the equation and a pure Graph object on the other. Trying to combine both, finding a middle ground where you have Graph but also vectorized GeometryArray would lead to unforeseen limitations I am afraid.
We need to make sure that we can efficiently convert from GeoDataFrames to networkx, igraph or cuGraph and back and leverage the power of geopandas on one side and the power of graph libraries on the other. Creating yet another representation that would lead to the sequence GeoDataFrame <conversion> spatial graph <conversion> proper graph
isn't a direction I'd like to explore.
The main conceptual issue I can see is the different logic of DataFrame and Graph objects. While the former is designed to be an array of objects and the most efficient ways of using it are vectorized array-based operations, Graph is different and in principle atomized to individual nodes and edges. That is why merging both to a single spatial graph object needs to inherently limit either geometry-based operations or graph-based operations. And I don't think we want to embed such a limitation into the very core design of it.
Both can return (gdf_nodes, gdf_edges) but only because that is the most natural way how to export Graph to GeoDataFrame(s). None of them uses it to represent the Graph itself and do the computation.
Yes, I guess I was thinking more about exchange than graph computation - dataframes work well for the edge-list representation of a graph, which networkx, igraph, cuGraph and others can accept or produce.
I think it's reasonable to imagine an object that holds a reference to both a graph (maybe an igraph or a networkx graph) and a spatial index (maybe a pygeos STRTree or a scipy kdtree) over the graph geometries. Pandana does something like this with its Network
with a kdtree and delegating down to cython/C++ for the graph implementation in cyaccess
.
On the other hand, OSMnx builds indexes on demand in osmnx.distance
. NetworkX does similarly internally, like building the graph adjacency matrix when an algorithm needs to do some linear algebra.
The core trade-off might be around deciding what internal data structures to maintain, and what to build only on demand - that's probably driven by what operations the library needs to support and how it's used. Is this list any use as a start? -
I'm still a bit unsure about scope here (trying to add useful things to the conversation without having much background 🙂 ) - for example, the refactor could be more incremental if it was at first limited to replacing use of pysal.cg
objects with shapely
geometries, then perhaps reworking parts of spaghetti.Network
to lean on networkx.MultiDiGraph
or similar.
trying to add useful things to the conversation without having much background
That is why I wanted to have a call. I also don't know much about spaghetti, I've never used it in production (sorry @jGaboardi 😄).
That is why I wanted to have a call
Yeah, I have been super busy with work, etc. I will try to set something up soon-ish. I really want to keep the ball rolling on this and not let it stop again.
As a first step to this I am thinking that simply printing out a deprecation warning on initialization that basically states libpysal
geometries will no longer be accepted as inputs starting in v2.0.0
. Next, I'll get to work on refactoring all the guts that rely on libpysal
geometry objects, which is almost everything. Once this is all worked out (which will take some time) I will release a v2.0.0
and continue with maintenance while a schema for total refactoring is being decided on. This total refactoring will be released as v3.0.0
.
@martinfleis As an overall strategy (especially the first step), would you agree this an OK plan? If not, any advice on how else to proceed?
@jGaboardi That sounds good to me.
Next, I'll get to work on refactoring all the guts that rely on libpysal geometry objects
On that we should have a chat before you start drafting it.
Moving forward we need to consider phasing out all functionality that requires (and even uses) native PySAL geometries to improve efficiency and interoperability within the ecosystem. These were necessary in the past, but are now a bottleneck for performance.