snap-stanford / snap

Stanford Network Analysis Platform (SNAP) is a general purpose network analysis and graph mining library.
Other
2.17k stars 797 forks source link

Loading a big graph is slow #148

Open minhlab opened 6 years ago

minhlab commented 6 years ago

I tried running the random walk example on a graph of 1.1B edges and it took 37 min just to load the graph into memory.

How can we improve this speed, e.g. using TVecPool somehow?

RemyLau commented 6 years ago

Same here!

I've been using node2vec (both c++ and python implementation) and it works just fine for small to medium size graphs (~20k nodes, ~300k edges); takes about 20 min for python and 5 min for c implementation (btw I've made some changes for the python version by utilizing parallelization with minimum overheads during preprocessing step and it takes about 2 min to run).

Anyways, recently I'm working with a dense graph (lets call it G), i.e. far from being sparse (~28k nodes, 300M edges, edgelist size ~10G) and the loading speed (~10s/100k edges) is way too slower than it is for loading the small graphs (lets call it g) mentioned earlier (: (~1s/100k edges). At first I though it was because of the size of the edgelist that make it slower compare to smaller graph, so I broke up the edgelist of G into smaller chunks with each has the size of edgelist of g. However, the loading speed does not change at all.

Then I open up the edgelist for both graph and compare them side by side, the only difference I could tell was the frequency of switching the source node, i.e. the first node id (I should have mentioned this earlier, but the edgelists are numerically sorted such that all edges with source node id=1 are placed at the top of the list, followed by souce node id=10 and so on). So for a sparse graph, the switching of source node is more frequent than that of the dense graph. With that "hypothesis", I scrambled up the order of edges in the edgelist of G. It turned out that the initial loading speed for G was improved significantly (~1s/100k edges). Yet, as more edges (at ~5M) are being loaded, the speed decrease to 10s/100k edges again. Any thoughts on how to improve? I'm guessing it has something to do with memory allocation (but i haven't found where the allocation is done in c implementation yet). Potential strategy I could think of is to increase the amount of allocation each time it ran out of space, so that it takes less time allocating memory (i'm not sure what's the allocation constant implemented in c, for std::vector it is 2?)? I would love to hear some suggestion, cuz I really need to get this embedding to work (and it needs to be fast) lol.

I've also tried using the loading tools written in python that creates a list of dictionaries, and that's the fastest loading tools for graph I've ever used; it takes ~10-15 min to fully load the edgelist. But the reason I don't use python implementation is that it somehow creates more overheads than c implementation and I haven't figured out why.

minhlab commented 5 years ago

Yeah I'm thinking along the line of memory allocation too. If you know before hand the number of nodes and edges, you might benefit from allocating everything once at the beginning. But I didn't figure out how to do that yet.