node2vec on Wikipedia - Githubissues

EntilZha commented 7 years ago

I am working on training node2vec embeddings on Wikipedia using snap/examples/node2vec. I am wanting to understand what is causing high memory usage and slow runtime to see if there is something I can do to improve performance.

System: AWS EC2 x1.32xlarge instance, 2TB RAM, 128 cores Dataset:

Raw node list: s3://entilzha-us-west-2/wiki-network/titles-sorted.txt
Raw edge list: s3://entilzha-us-west-2/wiki-network/links-simple-sorted.txt
These are processed using https://github.com/Pinafore/qb/blob/master/wiki_network/prepare_data.py via preprocess_titles and n2v_edge_list
This filters the dataset down and produces an edge list to feed in using node2vec -i:edge_list.txt -o:wiki.emb -v -dr

Final dataset info (I can post this input if its helpful):

4,427,968 nodes
82,876,900 edges

When this is run it uses 750GB of RAM and it has gotten to learning the word2vec embeddings, but the process seems very slow at about 1% per hour for having ~100% utilization on all 128 cores.

My general questions are:

Is there any guidance for training node2vec on wikipedia or more generally on a graph this large? The paper references one wiki dataset but seems like that is NLP centric and is first million bytes of text.
On a related note, are there by chance pre trained embeddings available from some other project?
Looking at the node2vec algorithm in 3.2.3, the observed memory usage seems quite high. The main things that consume memory seem like: the data structure holding walks and the data structure holding embeddings. The embeddings should be emb_dim*number_of_nodes~500,000,000 and assuming that each entry is a double should be 500,000,0008 bytes = 4GB. For walks it should be something like `n_walkswalk_sizenumber_of_nodes=1080*4,000,000= 3,200,000,000` entries and again assuming each entry is a double would lead to 25GB. Memory usage seems to go up here https://github.com/snap-stanford/snap/blob/master/snap-adv/n2v.cpp#L14. Any thoughts on why it's using 30x that amount of memory?
Is there a way that I could get just the random walk output and train the SGD step separately (on a GPU)?

From what I can tell things that should affect these would be:

Context size would make SGD faster
Number and length of walks per source would use less memory, but I don't know how small these should be for something like wikipedia (or if the defaults themselves are too low even)

Thanks!

vid-koci commented 7 years ago

Node2vec algorithm is quite memory consumptive. As you mentioned, the learning of embeddings is the slow part of the algorithm. The running time can be reduced (at the cost of quality of output) by decreasing the size of context. Furthermore, you can print the generated walks and train the embeddings separately. In the file snap/examples/node2vec/node2vec.cpp comment out the line 100 (WriteOutput). In the file snap/snap-adv/n2v.cpp comment out the line 38 (LearnEmbeddings). Instead, print the matrix WalksVV. Inserting the following code instead of LearnEmbeddings() should do the trick: for (int i = 0; i < WalksVV.GetXDim(); i++) { for (int64 j = 0; j < WalksVV.GetYDim(); j++) { printf("%d ",WalksVV(i, j)); } printf("\n"); } Node2vec will skip the word2vec training and will output the generated walks to the standard output (or to file if you replace printf with fprintf). You can use this to train the SGD step separately, using tensorflow, original word2vec (with Bag of Words model instead of Skip-Gram) or anything else you prefer.

I am not aware of any pre-trained embeddings for the Wikipedia dataset.

When it comes to memory consumption, it's not the word2vec, that is problematic. It is the preprocessing for the random walks (PreprocessTransitionProbs function). I designed an improvement to the node2vec algorithm that consumes much less memory, but its implementation is not yet final. It is available on my github https://github.com/vid-koci/snap/tree/master/examples/veles, but I cannot yet guarantee the 100% efficiency (this is why it's not yet merged into Snap). Also keep in mind that the word2vec part of this program is the same as in the node2vec.

I hope this helps.

EntilZha commented 7 years ago

Thanks for the info on how to separate out the different steps.

I was able to reduce the parameter sizes and got the model to run to completion, unfortunately, all the embeddings were -nans (or at least all of the ones I looked at). It's hard to know what caused it, but I imagine the large graph size might have been part of the issue.

I think I am going to try reducing the size of the graph even if that risks destroying some of the useful structure in the graph

vid-koci commented 7 years ago

Could you please send the exact parameters and data you ran the program with? The size of the graph should not cause the embeddings to be -nans.

doannam020293 commented 6 years ago

@vid-koci What is difference beween node2vec in your github, and node2vec in snap.I can't see any difference?

vid-koci commented 6 years ago

The node2vec in my github should be the same. On my github there also exists a project called Veles, a heuristical approach to node2vec that uses much less memory. Unfortunately, the implementation seems to produce embeddings of slightly (2-3%) lower quality than it should and I can't figure out why. This is why I haven't merged it into snap.

doannam020293 commented 6 years ago

Sorry, I can't access Veles project with this https://github.com/vid-koci/snap/tree/master/examples/veles link. In your github, it only public forked from snap, please update it.

vid-koci commented 6 years ago

It is placed in a separate branch https://github.com/vid-koci/snap/tree/veles/examples/veles. Use at your own risk.

doannam020293 commented 6 years ago

Thank you so much. How do you think about if I shoud cluster whole graph to 2 or more smalls graphs, to fit memory. Or shoud I try https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark instead?

vid-koci commented 6 years ago

Depends on the size of the graph. I don't think spark implementation uses any less memory than snap implementation (but you should still give it a try). Veles uses about 6x-40x less memory (especially in dense graphs). If that still uses too much ram, you can try deepwalk algorithm (not as good as node2vec but uses less ram) or cluster the graph, as you proposed. And then select the embeddings that give the best results on test data.

sabetAI commented 5 years ago

@vid-koci what heuristic are you using in veles that uses so much less memory? Could I avoid your reported performance loss if I outputted the random walks instead and ran word2vec on them separately?

vid-koci commented 5 years ago

A different approach to random walks. Node2vec uses a lot of memory because it uses 2nd order Markovian random walks and pre-computes all the transition tables in advance. This means one table entry for each walk of length 2, which grows quadratically in dense graphs.

In Veles, the transition tables are approximated with weighted trees and neighbourhoods with distances to a small number of randomly selected nodes. To answer your second question: The performance loss is probably due to some mistake in the heuristic implementation, so no. The word2vec implementation is the same as for node2vec.

KIwabuchi commented 5 years ago

Hello @vid-koci, I'm also interested in memory efficient node2vec. Veles looks very interesting! Do you have more detailed documents/papers about it? I read your code, but I'm still not sure how the heuristic works... Thank you.

vid-koci commented 5 years ago

Dear @KIwabuchi, there is one publication on it, however, it is not in English. You can find it here: http://eprints.fri.uni-lj.si/4019/ I am sorry for the inconvenience, hopefully, automatic translators can help you. Chapter 3 contains a description of the heuristic approach.

KIwabuchi commented 5 years ago

@vid-koci Thank you so much for the publication. I'll take a look at it.

snap-stanford / snap

node2vec on Wikipedia #99