networkx / networkx

Network Analysis in Python
https://networkx.org
Other
14.9k stars 3.25k forks source link

Random graph optimization opportunity #1458

Closed jg-you closed 9 years ago

jg-you commented 9 years ago

TL:DR

I've been looking into implementing the stochastic block model (will come in a later pull request), which eventually lead me to the source code of the G_{n,p} network genreator, and it's fast counterpart. After running some test, it appears that there is potential for optimization.

The regular algorithm iterates over the N(N-1)/2 potential edges and test if they exists with probability p. In principle its therefore of complexity O(N^2), but it has a strong dependency in p because no memory operation is required when a link is tested "negative". The fast algorithm is of complexity O(n+m) with m the expected number of link. It basically samples the distribution of delay between each positive test, and incorporates some fancy index tricks.

For any given p, the number of memory operations is -- by definition -- the same, on average. This allows us to factor out the memory cost of both methods. Choosing the most efficient method ("fast" vs. regular) is then only a matter of testing whether it's faster to sample a list of length N**2 geometrically or normally. There is no doubt that the geometric method is faster in extremely sparse cases, but there must be a p such that the overhead entailed by the geometric sampler outweighs its benefits.

The outcome is bound to be influenced by the environment (e.g. python version, hardware), but it's possible to come up with a rule of thumb that could be implemented directly in networkx. More precisely, for all p under the lower bound, gnp_random_graph could call fast_gnp_random_graph under the hood, and vice-versa.

Based on my test I believe that this value should be p=0.2. Because the transition point depends on the environment in non trivial ways, it should be overridable, but it could still be provided as a default.

Test code

from __future__ import division
import numpy as np
import networkx as nx
import time
import sys
import math

def mean(l):
    return sum(l)/len(l)

def stdev(l, avg):
    return math.sqrt(sum([ (x - avg) ** 2 for x in l])/(len(l)-1))

def main():
    current_milli_time = lambda: int(round(time.time() * 200))
    N = int(sys.argv[1])
    iterations = int(sys.argv[2])
    for p in np.arange(0.005, 0.30, 0.005):
        simple_times = []
        fast_times = []
        for i in range(0, iterations):
            start_t = current_milli_time()
            G = nx.gnp_random_graph(N, p)
            simple_times.append(current_milli_time() - start_t)
            G.clear()

            start_t = current_milli_time()
            G = nx.fast_gnp_random_graph(N, p)
            fast_times.append(current_milli_time() - start_t)
            G.clear()

        m1 = mean(simple_times)
        m2 = mean(fast_times)
        print("{0:.3f}".format(p) + "\t" +
              "{0:.8f}".format(m1) + "\t" +
              "{0:.8f}".format(stdev(simple_times, m1)) +"\t" +
              "{0:.8f}".format(m2) + "\t" +
              "{0:.8f}".format(stdev(fast_times ,m2)))

if __name__ == '__main__':
    main()

Results

Changing the graph size

Python 3.4, different lengths

The most important of all: comparison of the 2 methods for various list length

 N      |  # of edges
100     |  ~10 000
200     |  ~40 000
400     |  ~160 000

There is a small dependency in N; as N grows, the optimal op decrease ever so slightly. Might be noise, might a real effect. Nonetheless, p=0.20 seems like a good rule of thumb.

Time dispersion of both methods

Python3.4_stdev

The time dispersion is similar for both methods, i.e. running times are concentrated tightly around the average.

Environment's effect

Python 2.7.5 vs Python 3.4.2

The code runs slightly faster on python3.4, but the speed-up is much more pronounced for the normal method, which relies on plain iterators.

[Python 2.7.5, different hosts]

The hosts affect the time (obviously), and the transition point :/. All tests are performed on the fastest host (host B).

Python 3.4.2, optimization flag

The -O flag doesn't change anything, as expected.

chebee7i commented 9 years ago

I haven't had time to read through this in detail, but I know the historical trend has been that we haven't used a lot of magic methods (where a particular algorithm is automagically choosen for you). So my guess is that we'd probably have to add a new function that could serve as the new public API rather than modify either of the two existing ones.

jg-you commented 9 years ago

Perhaps that gnp_random_graph could serve as an overridable API? i.e. it selects automatically between the geometric and systematic method, but the systematic method can be enforced by setting an auto flag to False. Whereas, fast_gnp_random_graph would be guaranteed to use the geometric sampling method (i.e. remains the same)

chebee7i commented 9 years ago

Is there some helpful name that could be reapplied to each? This is a bad suggestion, but it is along the lines I was thinking of:

# Exposed at top level.
# If user wants a particular method, then they must call it explicitly.
gnp_random_graph # automagic

# Exposed only at nx.generators
gnp_random_graph_systematic
gnp_random_graph_geometric

I understand your suggestion...the asymmetry bothers for some OCD reason. Haha.

chebee7i commented 9 years ago

Also, I just realized the scale on the y-axis: ms. While the difference clearly exists, is a few milliseconds worth worrying about? I'm just trying to understand where this becomes very helpful. Is it for values of p much larger than .25?

ysitu commented 9 years ago

@chebee7i Would love to see cases where N has six or seven digits.

jg-you commented 9 years ago

@chebee7i That's also a good suggestion. I was just worried about breaking API contracts. Does networkx maintain a stable API? Another alternative would be

gnp_random_graph(N,p)  # automagic
fast_gnp_random_graph(N,p)  # automagic
gnp_random_graph(N,p,auto=False)  # enforces the systematic algorithm
fast_gnp_random_graph(N,p,auto=False)  # enforces the geometric algorithm

or the inverse (automagic must be explictly set). I prefer the former, since users that are aware of the difference (read the doc, etc), can probably select the best generator anyway.

@ysitu There is no doubt that there is a huge difference for large N. At N=400 we're already in the couple of seconds territory.

jg-you commented 9 years ago

Quite frankly, the only big issue I see with this proposal is the fact that the transition points vary from an environment to the other (and is likely to change as python gets more optimized).

ysitu commented 9 years ago

@jg-you I thought @chebee7i was saying the measured times were a few milliseconds?

chebee7i commented 9 years ago

Unless I am misreading the top graph, for N=400, the difference between the two algorithms doesn't look to be more than 2-3 milliseconds. Is the y-axis mislabeled?

jg-you commented 9 years ago

Oops my bad. It's indeed 2-3 ms -- thank god for proper labels!

You are right, it's a very small optimization opportunity. If someone needs truly large graph, they'll probably opt for something else than networkx anyways; it tends to get sluggish when one enters the multimillion node limit D:

jg-you commented 9 years ago

@ysitu This is what happens at "large" sizes (N=10 000). Differences ranges from 1 sec to ~ 10 sec if we interpolate to large p.

Large size

But again it goes to show that the crossing point is not predictable. Maybe add a note in the documentation (e.g. here and here) that hints what "small enough" is.

bjedwards commented 9 years ago

These results are interesting. I am a little blown away that the transition point doesn't seem to depend on the size of the graph, the version of python, or the host. One thing I am curious about is in the first figure it seems like gnp_random_graph is scaling rougly linearly in p, which I guess we should expect given that adding an edge probably takes longer than not adding an edge. fast_gnp_random_graph does not, which is strange because I feel like it should scale linearly with p. Does fast_gnp_random_graph catch up for larger values of p?

Can you give a few more details? What are the two hosts: OS, Architecture, etc...

jg-you commented 9 years ago

@bjedwards But it does depend on the host, version, and size! It sits in the range [0.1,0.2] for all tested cases, but that's as precise as it gets.

The first figure is somwhat misleading, it's on a semi log axis to fit all the curves on the same graph. Actually, it shows that gnp_random_graph is not linear at small p, i.e. small m. fast_gnp_random_graph is linear in p, as expected.

As for the host details

jgyou@HostA:~$> uname -a 
Linux HostA 3.2.0-74-generic #109-Ubuntu SMP Tue Dec 9 16:45:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
jgyou@HostA:jgyou$> cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.5 LTS"
NAME="Ubuntu"
VERSION="12.04.5 LTS, Precise Pangolin"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu precise (12.04.5 LTS)"
VERSION_ID="12.04"
jgyou@HostB:~$> uname -a 
Linux HostB 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 UTC 2015         
jgyou@HostB:~$> cat /etc/*-release
CentOS release 6.6 (Final)
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
CentOS release 6.6 (Final)
CentOS release 6.6 (Final)
hagberg commented 9 years ago

Algorithmically gnp_random_graph should scale as O(n^2) where n is the number of nodes. And fast_gnp_random_graph should scale as O(p n^2) - the number of edges. But it looks like practically adding edges to the graph costs something so gnp_random_graph also has some dependency on the number of edges.

bjedwards commented 9 years ago

@jg-you I didn't notice the log scale, whoops. I see that it sits in the range [0.1,0.2], but I am surprised this is consistent with different values(orders of magnitude apparently) of N. Before seeing this I would have assumed that the switch point would vary depending on N.

@hagberg I think that gnp_random_graph would actually scale with p

in the function

...
for e in edges:
    if random.random() < p:
        G.add_edge(*e)
return G

I am assuming the call to G.add_edge(*e) takes longer than not calling it. Large p values would mean more calls and longer running times.

jg-you commented 9 years ago

Closing the issue. The optimization window is too variable, and too heavily influenced by the environment to entail unified algorithm selection.