Open benmaier opened 4 years ago
@benmaier sorry for letting this sit so long. I really like this idea - a 100x speedup would be an excellent improvement. If you are still interested/have time, it'd be great if you could open a PR. There are other examples like this where there are multiple implementations, e.g. pagerank
, which has a fast scipy-based implementation and a slow pure-Python implementation.
Your idea of adding a new function e.g. nx.clustering_scipy
seems fine - it's very likely that we'd reorganize the API a bit similar to what was done for pagerank so that the most performant implementation is selected first (if scipy is installed), but these are all details that we could work out in a PR. The important thing would be getting the performant implementation into a PR so we could review it, connect it to the test suite, etc.
Problem
We're computing the local clustering coefficient of many directed, weighted networks of approximate size of
N = 400
nodes. The networks are rather dense with ca. 50% of all possible node combinations having a non-zero weight. The computation of the clustering coefficients usingnetworkx.clustering(G)
is rather slow (on the order of minutes).Identification
I'm reasonably certain that the repeated joining of neighbor sets is to blame (see e.g. https://github.com/networkx/networkx/blob/master/networkx/algorithms/cluster.py#L166). I don't have time to specifically test this hypothesis.
Solution
A simple implementation with sparse matrix linear algebra yields the desired results much faster (see script attached). The output of the attached script:
The implementation is based on Eq. (10) of https://arxiv.org/abs/physics/0612169 (as cited in the networkx-documentation) and should be valid for (undirected, unweighted), (directed, unweighted), (undirected, weighted), and (directed, weighted) networks. I have only tested it for directed, weighted networks though.
Proceeding
I'm open to submit a pull request containing the code attached, maybe in the style of an extra function
nx.clustering_with_scipy
, if this is desired. I'm open for discussion regarding this.I'm aware of the following problems regarding the function as it's currently written
scipy.sparse
might not comply with the networkx philosophy.Example Script