phase out libpysal geometries

jGaboardi commented 3 years ago

Moving forward we need to consider phasing out all functionality that requires (and even uses) native PySAL geometries to improve efficiency and interoperability within the ecosystem. These were necessary in the past, but are now a bottleneck for performance.

martinfleis commented 3 years ago

I am all in favour of this. I would even say we should deprecate libpysal.cg across the ecosystem.

Do you have an idea of a timeline for this? It may make sense to schedule a call and discuss overlaps with momepy. Since we'll be likely refactoring both in some way, it makes sense to do it once and in a super compatible way.

jGaboardi commented 3 years ago

I started updating the Refactor spaghetti document again. Take a look and let me know how you're thinking about it.

Maybe we can also loop in @tomalrussell and @HTenkanen? Anyone else you can think of?

tomalrussell commented 3 years ago

Hi @jGaboardi - I'd be happy to be looped in to discussions, though I'm not familiar with spaghetti internals.

The (gdf_nodes, gdf_edges) tuple or small wrapper object seems like it's the de-facto standard for spatial networks with explicit geometries (and optional attributes) for nodes and edges. You suggest it in your note, snkit.network.Network is a minimal wrapper, and pyrosm.OSM.get_network returns a tuple of gdfs.

I'm not sure if this is in scope for this issue, but there are various adaptors to graph/network representations:

momepy handles gdf<>nx conversion, including making a graph directly from LineStrings
osmnx works with networkx.MultiDiGraph and has helpers to/from gdfs osmnx.utils_graph.graph_from_gdfs
pandana has pandana.network.Network without explicit edge geometries or other node/edge attributes, but accepts node_x, node_y, edge_from, edge_to as pandas.Series and edge_weights as pandas.DataFrame.
pyrosm has adaptors to igraph, networkx, pandana, in pyrosm.pyrosm.OSM.to_graph

jGaboardi commented 3 years ago

@tomalrussell awesome!

... though I'm not familiar with spaghetti internals

The spaghetti internals are based on the archaic pysal.network. A complete overhaul is what we have been tiptoeing around for more than a year now.

A top priority for me (and other PySAL devs) is integration with current (segregation-@knaaptime) and proposed (momepy-@martinfleis) PySAL submodules that would benefit from a unified network representation, which spaghetti could built off of.

martinfleis commented 3 years ago

The (gdf_nodes, gdf_edges) tuple or small wrapper object seems like it's the de-facto standard for spatial networks with explicit geometries (and optional attributes) for nodes and edges

I don't necessarily agree with this statement. Probably the most common is the OSMnx graph, which is a networkx.MultiGraph with shapely geometries attached to individual nodes and edges in the graph object. momepy is using exactly the same model, just a bit more relaxed. Both can return (gdf_nodes, gdf_edges) but only because that is the most natural way how to export Graph to GeoDataFrame(s). None of them uses it to represent the Graph itself and do the computation. To my knowledge, pyrosm does the same thing in osm.to_graph(nodes, edges, graph_type="networkx").

When creating a representation of a spatial graph, we are doing so because we want to exploit graph theory in our analysis. I think that to maximise the potential, we should have GeoDataFrames on one side of the equation and a pure Graph object on the other. Trying to combine both, finding a middle ground where you have Graph but also vectorized GeometryArray would lead to unforeseen limitations I am afraid.

We need to make sure that we can efficiently convert from GeoDataFrames to networkx, igraph or cuGraph and back and leverage the power of geopandas on one side and the power of graph libraries on the other. Creating yet another representation that would lead to the sequence GeoDataFrame <conversion> spatial graph <conversion> proper graph isn't a direction I'd like to explore.

The main conceptual issue I can see is the different logic of DataFrame and Graph objects. While the former is designed to be an array of objects and the most efficient ways of using it are vectorized array-based operations, Graph is different and in principle atomized to individual nodes and edges. That is why merging both to a single spatial graph object needs to inherently limit either geometry-based operations or graph-based operations. And I don't think we want to embed such a limitation into the very core design of it.

tomalrussell commented 3 years ago

Both can return (gdf_nodes, gdf_edges) but only because that is the most natural way how to export Graph to GeoDataFrame(s). None of them uses it to represent the Graph itself and do the computation.

Yes, I guess I was thinking more about exchange than graph computation - dataframes work well for the edge-list representation of a graph, which networkx, igraph, cuGraph and others can accept or produce.

I think it's reasonable to imagine an object that holds a reference to both a graph (maybe an igraph or a networkx graph) and a spatial index (maybe a pygeos STRTree or a scipy kdtree) over the graph geometries. Pandana does something like this with its Network with a kdtree and delegating down to cython/C++ for the graph implementation in cyaccess.

On the other hand, OSMnx builds indexes on demand in osmnx.distance. NetworkX does similarly internally, like building the graph adjacency matrix when an algorithm needs to do some linear algebra.

The core trade-off might be around deciding what internal data structures to maintain, and what to build only on demand - that's probably driven by what operations the library needs to support and how it's used. Is this list any use as a start? -

geometry manipulation (precision, CRS, simplification, merging)
geometry queries/predicates (nearest, intersects, contains...)
graph traversal (anything where you visit neighbours - BFS, DFS, Dijkstra, components...)
creation, conversion (from/to objects, files)
mutability (add/remove nodes and edges, change attributes or geometries)

I'm still a bit unsure about scope here (trying to add useful things to the conversation without having much background 🙂 ) - for example, the refactor could be more incremental if it was at first limited to replacing use of pysal.cg objects with shapely geometries, then perhaps reworking parts of spaghetti.Network to lean on networkx.MultiDiGraph or similar.

martinfleis commented 3 years ago

trying to add useful things to the conversation without having much background

That is why I wanted to have a call. I also don't know much about spaghetti, I've never used it in production (sorry @jGaboardi 😄).

jGaboardi commented 3 years ago

That is why I wanted to have a call

Yeah, I have been super busy with work, etc. I will try to set something up soon-ish. I really want to keep the ball rolling on this and not let it stop again.

jGaboardi commented 3 years ago

As a first step to this I am thinking that simply printing out a deprecation warning on initialization that basically states libpysal geometries will no longer be accepted as inputs starting in v2.0.0. Next, I'll get to work on refactoring all the guts that rely on libpysal geometry objects, which is almost everything. Once this is all worked out (which will take some time) I will release a v2.0.0 and continue with maintenance while a schema for total refactoring is being decided on. This total refactoring will be released as v3.0.0.

@martinfleis As an overall strategy (especially the first step), would you agree this an OK plan? If not, any advice on how else to proceed?

martinfleis commented 3 years ago

@jGaboardi That sounds good to me.

Next, I'll get to work on refactoring all the guts that rely on libpysal geometry objects

On that we should have a chat before you start drafting it.

pysal / spaghetti

phase out libpysal geometries #646