Closed jg-you closed 9 years ago
I haven't had time to read through this in detail, but I know the historical trend has been that we haven't used a lot of magic methods (where a particular algorithm is automagically choosen for you). So my guess is that we'd probably have to add a new function that could serve as the new public API rather than modify either of the two existing ones.
Perhaps that gnp_random_graph
could serve as an overridable API?
i.e. it selects automatically between the geometric and systematic method, but the systematic method can be enforced by setting an auto flag to False
.
Whereas, fast_gnp_random_graph
would be guaranteed to use the geometric sampling method (i.e. remains the same)
Is there some helpful name that could be reapplied to each? This is a bad suggestion, but it is along the lines I was thinking of:
# Exposed at top level.
# If user wants a particular method, then they must call it explicitly.
gnp_random_graph # automagic
# Exposed only at nx.generators
gnp_random_graph_systematic
gnp_random_graph_geometric
I understand your suggestion...the asymmetry bothers for some OCD reason. Haha.
Also, I just realized the scale on the y-axis: ms. While the difference clearly exists, is a few milliseconds worth worrying about? I'm just trying to understand where this becomes very helpful. Is it for values of p much larger than .25?
@chebee7i Would love to see cases where N has six or seven digits.
@chebee7i That's also a good suggestion. I was just worried about breaking API contracts. Does networkx
maintain a stable API? Another alternative would be
gnp_random_graph(N,p) # automagic
fast_gnp_random_graph(N,p) # automagic
gnp_random_graph(N,p,auto=False) # enforces the systematic algorithm
fast_gnp_random_graph(N,p,auto=False) # enforces the geometric algorithm
or the inverse (automagic must be explictly set). I prefer the former, since users that are aware of the difference (read the doc, etc), can probably select the best generator anyway.
@ysitu There is no doubt that there is a huge difference for large N
. At N=400
we're already in the couple of seconds territory.
Quite frankly, the only big issue I see with this proposal is the fact that the transition points vary from an environment to the other (and is likely to change as python gets more optimized).
@jg-you I thought @chebee7i was saying the measured times were a few milliseconds?
Unless I am misreading the top graph, for N=400, the difference between the two algorithms doesn't look to be more than 2-3 milliseconds. Is the y-axis mislabeled?
Oops my bad. It's indeed 2-3 ms -- thank god for proper labels!
You are right, it's a very small optimization opportunity.
If someone needs truly large graph, they'll probably opt for something else than networkx
anyways; it tends to get sluggish when one enters the multimillion node limit D:
These results are interesting. I am a little blown away that the transition point doesn't seem to depend on the size of the graph, the version of python, or the host. One thing I am curious about is in the first figure it seems like gnp_random_graph
is scaling rougly linearly in p
, which I guess we should expect given that adding an edge probably takes longer than not adding an edge. fast_gnp_random_graph
does not, which is strange because I feel like it should scale linearly with p
. Does fast_gnp_random_graph
catch up for larger values of p
?
Can you give a few more details? What are the two hosts: OS, Architecture, etc...
@bjedwards But it does depend on the host, version, and size! It sits in the range [0.1,0.2] for all tested cases, but that's as precise as it gets.
The first figure is somwhat misleading, it's on a semi log axis to fit all the curves on the same graph.
Actually, it shows that gnp_random_graph
is not linear at small p
, i.e. small m
. fast_gnp_random_graph
is linear in p
, as expected.
As for the host details
jgyou@HostA:~$> uname -a
Linux HostA 3.2.0-74-generic #109-Ubuntu SMP Tue Dec 9 16:45:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
jgyou@HostA:jgyou$> cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.5 LTS"
NAME="Ubuntu"
VERSION="12.04.5 LTS, Precise Pangolin"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu precise (12.04.5 LTS)"
VERSION_ID="12.04"
jgyou@HostB:~$> uname -a
Linux HostB 2.6.32-504.8.1.el6.x86_64 #1 SMP Wed Jan 28 21:11:36 UTC 2015
jgyou@HostB:~$> cat /etc/*-release
CentOS release 6.6 (Final)
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
CentOS release 6.6 (Final)
CentOS release 6.6 (Final)
Algorithmically gnp_random_graph should scale as O(n^2) where n is the number of nodes. And fast_gnp_random_graph should scale as O(p n^2) - the number of edges. But it looks like practically adding edges to the graph costs something so gnp_random_graph also has some dependency on the number of edges.
@jg-you I didn't notice the log scale, whoops. I see that it sits in the range [0.1,0.2], but I am surprised this is consistent with different values(orders of magnitude apparently) of N. Before seeing this I would have assumed that the switch point would vary depending on N.
@hagberg I think that gnp_random_graph
would actually scale with p
in the function
...
for e in edges:
if random.random() < p:
G.add_edge(*e)
return G
I am assuming the call to G.add_edge(*e)
takes longer than not calling it. Large p
values would mean more calls and longer running times.
Closing the issue. The optimization window is too variable, and too heavily influenced by the environment to entail unified algorithm selection.
TL:DR
gnp_random_graph
is faster thanfast_gnp_random_graph
forp>0.2
, and vice-versa.Rambling
I've been looking into implementing the stochastic block model (will come in a later pull request), which eventually lead me to the source code of the G_{n,p} network genreator, and it's fast counterpart. After running some test, it appears that there is potential for optimization.
The regular algorithm iterates over the
N(N-1)/2
potential edges and test if they exists with probabilityp
. In principle its therefore of complexityO(N^2)
, but it has a strong dependency inp
because no memory operation is required when a link is tested "negative". The fast algorithm is of complexityO(n+m)
withm
the expected number of link. It basically samples the distribution of delay between each positive test, and incorporates some fancy index tricks.For any given
p
, the number of memory operations is -- by definition -- the same, on average. This allows us to factor out the memory cost of both methods. Choosing the most efficient method ("fast" vs. regular) is then only a matter of testing whether it's faster to sample a list of lengthN**2
geometrically or normally. There is no doubt that the geometric method is faster in extremely sparse cases, but there must be ap
such that the overhead entailed by the geometric sampler outweighs its benefits.The outcome is bound to be influenced by the environment (e.g. python version, hardware), but it's possible to come up with a rule of thumb that could be implemented directly in networkx. More precisely, for all
p
under the lower bound,gnp_random_graph
could callfast_gnp_random_graph
under the hood, and vice-versa.Based on my test I believe that this value should be
p=0.2
. Because the transition point depends on the environment in non trivial ways, it should be overridable, but it could still be provided as a default.Test code
Results
Changing the graph size
The most important of all: comparison of the 2 methods for various list length
There is a small dependency in
N
; as N grows, the optimal op decrease ever so slightly. Might be noise, might a real effect. Nonetheless, p=0.20 seems like a good rule of thumb.Time dispersion of both methods
The time dispersion is similar for both methods, i.e. running times are concentrated tightly around the average.
Environment's effect
The code runs slightly faster on
python3.4
, but the speed-up is much more pronounced for the normal method, which relies on plain iterators.The hosts affect the time (obviously), and the transition point :/. All tests are performed on the fastest host (host B).
The
-O
flag doesn't change anything, as expected.