Open EntilZha opened 7 years ago
Node2vec algorithm is quite memory consumptive.
As you mentioned, the learning of embeddings is the slow part of the algorithm. The running time can be reduced (at the cost of quality of output) by decreasing the size of context.
Furthermore, you can print the generated walks and train the embeddings separately.
In the file snap/examples/node2vec/node2vec.cpp comment out the line 100 (WriteOutput).
In the file snap/snap-adv/n2v.cpp comment out the line 38 (LearnEmbeddings). Instead, print the matrix WalksVV. Inserting the following code instead of LearnEmbeddings() should do the trick:
for (int i = 0; i < WalksVV.GetXDim(); i++) { for (int64 j = 0; j < WalksVV.GetYDim(); j++) { printf("%d ",WalksVV(i, j)); } printf("\n"); }
Node2vec will skip the word2vec training and will output the generated walks to the standard output (or to file if you replace printf with fprintf). You can use this to train the SGD step separately, using tensorflow, original word2vec (with Bag of Words model instead of Skip-Gram) or anything else you prefer.
I am not aware of any pre-trained embeddings for the Wikipedia dataset.
When it comes to memory consumption, it's not the word2vec, that is problematic. It is the preprocessing for the random walks (PreprocessTransitionProbs function). I designed an improvement to the node2vec algorithm that consumes much less memory, but its implementation is not yet final. It is available on my github https://github.com/vid-koci/snap/tree/master/examples/veles, but I cannot yet guarantee the 100% efficiency (this is why it's not yet merged into Snap). Also keep in mind that the word2vec part of this program is the same as in the node2vec.
I hope this helps.
Thanks for the info on how to separate out the different steps.
I was able to reduce the parameter sizes and got the model to run to completion, unfortunately, all the embeddings were -nans (or at least all of the ones I looked at). It's hard to know what caused it, but I imagine the large graph size might have been part of the issue.
I think I am going to try reducing the size of the graph even if that risks destroying some of the useful structure in the graph
Could you please send the exact parameters and data you ran the program with? The size of the graph should not cause the embeddings to be -nans.
@vid-koci What is difference beween node2vec in your github, and node2vec in snap.I can't see any difference?
The node2vec in my github should be the same. On my github there also exists a project called Veles, a heuristical approach to node2vec that uses much less memory. Unfortunately, the implementation seems to produce embeddings of slightly (2-3%) lower quality than it should and I can't figure out why. This is why I haven't merged it into snap.
Sorry, I can't access Veles project with this https://github.com/vid-koci/snap/tree/master/examples/veles link. In your github, it only public forked from snap, please update it.
It is placed in a separate branch https://github.com/vid-koci/snap/tree/veles/examples/veles. Use at your own risk.
Thank you so much. How do you think about if I shoud cluster whole graph to 2 or more smalls graphs, to fit memory. Or shoud I try https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark instead?
Depends on the size of the graph. I don't think spark implementation uses any less memory than snap implementation (but you should still give it a try). Veles uses about 6x-40x less memory (especially in dense graphs). If that still uses too much ram, you can try deepwalk algorithm (not as good as node2vec but uses less ram) or cluster the graph, as you proposed. And then select the embeddings that give the best results on test data.
@vid-koci what heuristic are you using in veles that uses so much less memory? Could I avoid your reported performance loss if I outputted the random walks instead and ran word2vec on them separately?
A different approach to random walks. Node2vec uses a lot of memory because it uses 2nd order Markovian random walks and pre-computes all the transition tables in advance. This means one table entry for each walk of length 2, which grows quadratically in dense graphs.
In Veles, the transition tables are approximated with weighted trees and neighbourhoods with distances to a small number of randomly selected nodes. To answer your second question: The performance loss is probably due to some mistake in the heuristic implementation, so no. The word2vec implementation is the same as for node2vec.
Hello @vid-koci, I'm also interested in memory efficient node2vec. Veles looks very interesting! Do you have more detailed documents/papers about it? I read your code, but I'm still not sure how the heuristic works... Thank you.
Dear @KIwabuchi, there is one publication on it, however, it is not in English. You can find it here: http://eprints.fri.uni-lj.si/4019/ I am sorry for the inconvenience, hopefully, automatic translators can help you. Chapter 3 contains a description of the heuristic approach.
@vid-koci Thank you so much for the publication. I'll take a look at it.
I am working on training node2vec embeddings on Wikipedia using
snap/examples/node2vec
. I am wanting to understand what is causing high memory usage and slow runtime to see if there is something I can do to improve performance.System: AWS EC2 x1.32xlarge instance, 2TB RAM, 128 cores Dataset:
s3://entilzha-us-west-2/wiki-network/titles-sorted.txt
s3://entilzha-us-west-2/wiki-network/links-simple-sorted.txt
preprocess_titles
andn2v_edge_list
node2vec -i:edge_list.txt -o:wiki.emb -v -dr
Final dataset info (I can post this input if its helpful):
When this is run it uses 750GB of RAM and it has gotten to learning the word2vec embeddings, but the process seems very slow at about 1% per hour for having ~100% utilization on all 128 cores.
My general questions are:
emb_dim*number_of_nodes~500,000,000
and assuming that each entry is a double should be 500,000,0008 bytes = 4GB. For walks it should be something like `n_walkswalk_sizenumber_of_nodes=1080*4,000,000= 3,200,000,000` entries and again assuming each entry is a double would lead to 25GB. Memory usage seems to go up here https://github.com/snap-stanford/snap/blob/master/snap-adv/n2v.cpp#L14. Any thoughts on why it's using 30x that amount of memory?From what I can tell things that should affect these would be:
Thanks!