igraph / python-igraph

Python interface for igraph
GNU General Public License v2.0
1.32k stars 249 forks source link

Efficient computation of fully qualified shortest paths of many-to-many vertex pairs #673

Closed janikmu closed 1 year ago

janikmu commented 1 year ago

What is the feature or improvement you would like to see? The now deprecated function shortest_paths (since 0.10.0) allowed for output of fully qualified paths, either a path of vertices or edges, not just path lengths for multiple source and target pairs. The replacement, distances lacks that (which is fair, as it has another use case). I would love to see this functionality implemented back in again, maybe in a separate function.

Use cases for the feature I am quite sure that computing multiple shortest path pairs at the same time must allow for some performance gain compared to computing shortest path for each pair individually (especially if there is a lot of overlap in the paths). The functions that remain either compute all shortest paths, which is an overhead for my use case, or shortest paths from a single source node to multiple other target vertices (get_shortest_paths), which is needs some pre- and post-processing for my use case and is likely slower.

References I am unsure how this might be related to https://github.com/igraph/python-igraph/issues/470

szhorvat commented 1 year ago

The now deprecated function shortest_paths (since 0.10.0) allowed for output of fully qualified paths, either a path of vertices or edges, not just path lengths for multiple source and target pairs.

This is not correct. shortest_paths() only computed distances, not paths. This is why it was simply renamed to distances() without change of functionality.

The functions that remain either compute all shortest paths, which is an overhead for my use case, or shortest paths from a single source node to multiple other target vertices (get_shortest_paths), which is needs some pre- and post-processing for my use case and is likely slower.

This is also not correct. get_shortest_paths() has a v parameter for sources and a to parameter for targets. Provide a single source and single target to get at most one shortest path between them.


In other words, no functionality was lost with this deprecation.

janikmu commented 1 year ago

This is not correct. shortest_paths() only computed distances, not paths.

Hm, then I misinterpreted the documentation of the now deprecated function back in the day, making this issue a completely new feature request indeed.

This is also not correct. get_shortest_paths() has a v parameter for sources and a to parameter for targets. Provide a single source and single target to get at most one shortest path between them.

Yes, but it does not allow for multiple v parameters, which my feature request is about: Computation of many vertex-pairs shortest paths (any) at once.

szhorvat commented 1 year ago

Yes, but it does not allow for multiple v parameters, which my feature request is about: Computation of many vertex-pairs shortest paths (any) at once.

It seems I misinterpreted your request. I thought you wanted a single source to a single target function that returns paths (not distances). We have this available as a special case of the single source to multiple targets function, and not much additional optimization is possible here due to how Dijkstra's algorithm works.

What you are actually requesting is a multiple sources to multiple targets function, the motivation being better performance than several calls to the single source to multiple targets function. Is this correct?

janikmu commented 1 year ago

What you are actually requesting is a multiple sources to multiple targets function, the motivation being better performance than several calls to the single source to multiple targets function. Is this correct?

Yes, exactly. I am especially thinking of cases, where some paths are sub-paths to other paths. Here, multiple separate calls to the single source to multiple targets function might not fully take advantage of that.

szhorvat commented 1 year ago

Let's think about this idea a bit more carefully. There are two distinct types of optimization opportunities in a many-to-many shortest path function, compared to multiple calls to a one-to-many function: technical and algorithmic.

Technical:

igraph currently converts the graph into an adjacency list representation before computing shortest paths. This conversion happens on each call to the one-to-many function. A many-to-many function could do it only once. I benchmarked this before, and it's clear that there were significant performance benefits when computing distances only. It's less clear how much of an improvement we'd get when returning the actual paths. This should be benchmarked first. If you have time to help with this @janikmu, that would be appreciated. Note that it must be done in C, not Python.

Algorithmic:

You are hoping for optimization opportunities exploiting the overlap between shortest paths. This is an interesting, but far from trivial problem. Are you aware of any existing many-to-many algorithms that exploit overlaps, and if yes, can you post some references?

The best-known weighted shortest path algorithms are either one-to-all (Dijkstra, Bellman-Ford) or all-to-all (Floyd-Warshall). In the majority of practical cases, running Dikjstra for all sources outperforms Floyd-Warshall even for all-to-all computations (it's only faster for very dense graphs). Note that there aren't any efficient one-to-one algorithms. Thus let's think in terms of the one-to-all ones.

One-to-all algorithms already effectively exploit the overlaps by computing a shortest path tree. Is it possible to somehow extend this to a many-to-all, or all-to-all case? In a shortest path tree, we get not only the paths from the root to all other vertices, but from any vertex v to all descendants of that vertex within the tree. But what if we need shortest paths from v to a target that is not its descendant within this tree? Then we can run the one-to-all algorithm again, now starting from v. Is it possible to make use of the previously computed tree to speed up this second run?

This is not an easy question. I would say it's a research-level problem. If you are aware of any studies on it, please let us know @janikmu. If you could do a literature search on the topic, that would take us halfway there. As usual with graph algorithms, the most time consuming part is not the coding, but reading up on and understanding the theory.

Technical issues:

We do not have a data structure that can represent a matrix of paths (as opposed to a matrix of scalar values). This is of course solvable, but it would be a major addition to the C core, so it's good to point it out here.

szhorvat commented 1 year ago

This is an interesting concept, but it seems to exploit specific properties of road networks, so it may not be suitable for generic networks: https://en.wikipedia.org/wiki/Contraction_hierarchies

The speed-up is achieved by creating shortcuts in a preprocessing phase which are then used during a shortest-path query to skip over "unimportant" vertices.[2] This is based on the observation that road networks are highly hierarchical. Some intersections, for example highway junctions, are "more important" and higher up in the hierarchy than for example a junction leading into a dead end. Shortcuts can be used to save the precomputed distance between two important junctions such that the algorithm doesn't have to consider the full path between these junctions at query time.

janikmu commented 1 year ago

Funny that you point out the road network specific concept for exploiting connection importance. This is exactly where my personal need stems from: Transportation in a road network with some specific edge weights. However, of course, this would need to be implemented for all types of graphs, not just certain types, indeed complicating the problem.

igraph currently converts the graph into an adjacency list representation before computing shortest paths. This conversion happens on each call to the one-to-many function. A many-to-many function could do it only once. I benchmarked this before, and it's clear that there were significant performance benefits when computing distances only. It's less clear how much of an improvement we'd get when returning the actual paths. This should be benchmarked first. If you have time to help with this @janikmu, that would be appreciated. Note that it must be done in C, not Python.

Unfortunately, I have zero experience with C at this point, and don't have the time to change that short-term.

This is not an easy question. I would say it's a research-level problem. [...] As usual with graph algorithms, the most time consuming part is not the coding, but reading up on and understanding the theory.

Your intuition of exploiting one-to-all trees is really similar to what I had in mind. I will see what I can do reading up on the algorithmic side, but I just realised that this is indeed an uncommon problem. Regarding the representation of such structures: This is also why I linked #470, as I think this would profit from similar under-the-hood data structures.

szhorvat commented 1 year ago

I will see what I can do reading up on the algorithmic side

Thanks!

This is exactly where my personal need stems from: Transportation in a road network with some specific edge weights.

Can you describe your specific use case in more detail? It is always helpful to know how people use igraph in practice. How big a network do you have, how many from and to vertices do you need to process in a single query, do you need to perform multiple queries on the very same graph, etc. ?


I'll open a separate feature request once we have a clearer picture of what could be implemented.