networkx / networkx

Network Analysis in Python
https://networkx.org
Other
14.76k stars 3.21k forks source link

Computation of the clustering coefficient is rather slow #3997

Open benmaier opened 4 years ago

benmaier commented 4 years ago

Problem

We're computing the local clustering coefficient of many directed, weighted networks of approximate size of N = 400 nodes. The networks are rather dense with ca. 50% of all possible node combinations having a non-zero weight. The computation of the clustering coefficients using networkx.clustering(G) is rather slow (on the order of minutes).

Identification

I'm reasonably certain that the repeated joining of neighbor sets is to blame (see e.g. https://github.com/networkx/networkx/blob/master/networkx/algorithms/cluster.py#L166). I don't have time to specifically test this hypothesis.

Solution

A simple implementation with sparse matrix linear algebra yields the desired results much faster (see script attached). The output of the attached script:

computing with sparse matrices
... needed 0.5278730392456055 seconds
computing with pure networkx
... needed 328.80611991882324 seconds
average derivation of results = 0.0

The implementation is based on Eq. (10) of https://arxiv.org/abs/physics/0612169 (as cited in the networkx-documentation) and should be valid for (undirected, unweighted), (directed, unweighted), (undirected, weighted), and (directed, weighted) networks. I have only tested it for directed, weighted networks though.

Proceeding

I'm open to submit a pull request containing the code attached, maybe in the style of an extra function nx.clustering_with_scipy, if this is desired. I'm open for discussion regarding this.

I'm aware of the following problems regarding the function as it's currently written

  1. Limited functionality. Clustering is computed for all nodes and returned as a numpy array instead of a dictionary.
  2. Using scipy.sparse might not comply with the networkx philosophy.
  3. It appears to me that the philosophy of the current implementation is to save memory instead of time. The implementation below favors speed over memory.

Example Script

import networkx as nx
import numpy as np
import scipy.sparse as sprs

def get_clustering(G, weight='weight'):

    N = G.number_of_nodes()

    # get binary adjacency matrix with self-loops removed
    A = nx.to_scipy_sparse_matrix(G, weight=None, format='lil')
    A.setdiag(np.zeros((N,)))
    A = A.tocsr()

    # get weighted adjacency matrix with self-loops removed
    W = nx.to_scipy_sparse_matrix(G, weight=weight, format='lil')
    W.setdiag(np.zeros((N,)))
    W = W.tocsr()
    W = W.asfptype() # cast to floating-point matrix

    # compute total degree (in-degree + out-degree) and symmetric degree 
    # (number of neighbors where both in- and out-edge exist)
    dtot = np.array(A.sum(axis=0)).flatten() + np.array(A.sum(axis=1)).flatten()
    dsym = (A.dot(A)).diagonal()

    # normalize weights, prepare for geometric mean, and build symmetrized weight matrix
    Wmax = W.data.max()
    W.data /= Wmax
    W.data = W.data**(1/3)
    W = W + W.T

    # compute cumulated triangle intensity and number of possible triangles
    numerator = 0.5*(W.dot(W).dot(W)).diagonal()
    denominator = dtot * (dtot-1) - 2*dsym

    # set clustering coefficient to zero for nodes where no triangles 
    # can exist
    ndx = np.where(denominator<=0.0)[0]
    numerator[ndx] = 0
    denominator[ndx] = 1

    return numerator / denominator

if __name__=="__main__":

    from time import time

    N = 400

    # set seed and create weighted matrix where 50 percent of edges exist
    np.random.seed(9883245)
    W = np.exp(np.random.randn(N,N))

    W[W<=np.median(W)] = 0

    # convert to DiGraph object
    G = nx.from_numpy_array(W,create_using=nx.DiGraph)

    print("computing with sparse matrices")
    start = time()
    C_sprs = get_clustering(G)
    end = time()
    print("... needed", end-start, "seconds")

    print("computing with pure networkx")
    start = time()
    C_nx = nx.clustering(G,weight='weight')
    end = time()
    print("... needed", end-start, "seconds")
    C_nx = np.array(list(C_nx.values()))

    print("average derivation of results =", np.abs(1-np.mean(C_sprs/C_nx)))
rossbar commented 2 years ago

@benmaier sorry for letting this sit so long. I really like this idea - a 100x speedup would be an excellent improvement. If you are still interested/have time, it'd be great if you could open a PR. There are other examples like this where there are multiple implementations, e.g. pagerank, which has a fast scipy-based implementation and a slow pure-Python implementation.

Your idea of adding a new function e.g. nx.clustering_scipy seems fine - it's very likely that we'd reorganize the API a bit similar to what was done for pagerank so that the most performant implementation is selected first (if scipy is installed), but these are all details that we could work out in a PR. The important thing would be getting the performant implementation into a PR so we could review it, connect it to the test suite, etc.