uhh-lt / path2vec

Learning to represent shortest paths and other graph-based measures of node similarities with graph embeddings
Apache License 2.0
33 stars 14 forks source link

node2vec baseline #5

Closed alexanderpanchenko closed 6 years ago

alexanderpanchenko commented 6 years ago
  1. Download the WordNet graph here: http://ltdata1.informatik.uni-hamburg.de/shortest_path/graph/

  2. Convert the adjlist file to edgelist file (which the node2vec takes as an input) using networkX

  1. Train the graph embeddings using the node2vec (https://github.com/snap-stanford/snap/tree/master/examples/node2vec) for all combinations of the following parameters:

Number of dimensions: 50, 100, 200, 300 Number of walks per source. 5, 10, 25 Context size for optimization: 5, 10, 25 Number of epochs in SGD: 1, 5, 10 Graph is directed: true, false

Overall, you need to output 4 3 3 3 2 models (the names of each embedding file should contain the hyperparameters).

  1. Lookup names of the synsets for each file from the pickle file (http://ltdata1.informatik.uni-hamburg.de/shortest_path/graph/nodes.pkl).

You can do it using this script: https://github.com/uhh-lt/shortpath2vec/blob/master/deepwalk/convert_embedding.py

OR manually, like ...


with open('nodes.pkl', 'rb') as f:
    pkl = pickle.load(f)
synset_name = pkl.get(node_index)
  1. Perform the evalution of each model using the evaluation script. See here details: https://github.com/uhh-lt/shortpath2vec.

  2. Save the results in this table in the 'node2vec' sheet (similarly as for the deepwalk method): https://docs.google.com/spreadsheets/d/1KjNns16ld3pVUY1K7aA0HH9Lrb1PiKBH7-ZIbWrD9Kc/edit#gid=1803816318

snkohail commented 6 years ago

I have a problem during evaluation: kohail@ltgpu1:~/shortpath2vec$ python3 evaluation.py converted_emb/converted_wordnet.100.5.5.1.T.emb simlex/simlex_synsets/max_jcn_brown_human.tsv [nltk_data] Downloading package wordnet to /home/kohail/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip. 2018-05-18 16:04:55,305 : INFO : loading projection weights from converted_emb/converted_wordnet.100.5.5.1.T.emb 2018-05-18 16:05:01,960 : INFO : loaded (74401, 100) matrix from converted_emb/converted_wordnet.100.5.5.1.T.emb 2018-05-18 16:05:01,960 : INFO : precomputing L2-norms of word weight vectors 2018-05-18 16:05:02,625 : INFO : Pearson correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: 0.0276 2018-05-18 16:05:02,625 : INFO : Spearman rank-order correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: -0.0004 2018-05-18 16:05:02,626 : INFO : Pairs with unknown words ratio: 1.2% Traceback (most recent call last): File "evaluation.py", line 19, in dynamic_synset_score = evaluate_synsets(model, 'simlex/simlex_original.tsv', logger, dummy4unknown=True) File "/home/kohail/shortpath2vec/evaluate_lemmas.py", line 53, in evaluate_synsets possible_similarity = emb_model.similarity(pair[0].name(), pair[1].name()) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 828, in similarity return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 169, in getitem return self.get_vector(entities) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 277, in get_vector return self.word_vec(word) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 274, in word_vec raise KeyError("word '%s' not in vocabulary" % word) KeyError: "word 'koran.n.01' not in vocabulary"

alexanderpanchenko commented 6 years ago

do you have an embedding for this synset (koran.n.01)?

On Fri, May 18, 2018 at 4:06 PM Sarah notifications@github.com wrote:

I have a problem during evaluation: kohail@ltgpu1:~/shortpath2vec$ python3 evaluation.py converted_emb/converted_wordnet.100.5.5.1.T.emb simlex/simlex_synsets/max_jcn_brown_human.tsv [nltk_data] Downloading package wordnet to /home/kohail/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip. 2018-05-18 16:04:55,305 : INFO : loading projection weights from converted_emb/converted_wordnet.100.5.5.1.T.emb 2018-05-18 16:05:01,960 : INFO : loaded (74401, 100) matrix from converted_emb/converted_wordnet.100.5.5.1.T.emb 2018-05-18 16:05:01,960 : INFO : precomputing L2-norms of word weight vectors 2018-05-18 16:05:02,625 : INFO : Pearson correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: 0.0276 2018-05-18 16:05:02,625 : INFO : Spearman rank-order correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: -0.0004 2018-05-18 16:05:02,626 : INFO : Pairs with unknown words ratio: 1.2% Traceback (most recent call last): File "evaluation.py", line 19, in dynamic_synset_score = evaluate_synsets(model, 'simlex/simlex_original.tsv', logger, dummy4unknown=True) File "/home/kohail/shortpath2vec/evaluate_lemmas.py", line 53, in evaluate_synsets possible_similarity = emb_model.similarity(pair[0].name(), pair[1].name()) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 828, in similarity return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 169, in getitem return self.get_vector(entities) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 277, in get_vector return self.word_vec(word) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 274, in word_vec raise KeyError("word '%s' not in vocabulary" % word) KeyError: "word 'koran.n.01' not in vocabulary"

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/5#issuecomment-390218376, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vgFaAK8mLV-ttr4N5GnXQ5dNEn3Sks5tztVsgaJpZM4UBALn .

snkohail commented 6 years ago

No. I checked.. its not there ..

alexteua commented 6 years ago

@snkohail I had the same problem when exported graph to an edge list file, then it just didn't include the vertexes with no edges. (koran.n.01 has no hypernyms or hyponyms, for example)

alexteua commented 6 years ago

you can use this adjlist file, it has all of them: https://github.com/uhh-lt/shortpath2vec/blob/master/deepwalk/wordnet.adjlist

snkohail commented 6 years ago

Oh thanks.. I'll convert it and run node2vec again ..

snkohail commented 6 years ago

@alexteua I still have the same problem .. When I convert the adjlist to edgelist and apply node2vec, no embeddings for the synset "koran" is produced...

snkohail commented 6 years ago

here is a sample of the produced embeddings by node2vec: https://drive.google.com/open?id=1E9MwpHMKcQK7nGE22fvw9kK9C1qqIQlk

alexteua commented 6 years ago

@snkohail I guess not connected vertexes are lost during converting from adjlist to edgelist (did you check edgelist you got for the "koran" synset?) also can node2vec implementation you are using consume adjlist format? can you send me the link to that implementation?

alexteua commented 6 years ago

@akutuzov do you know how to preserve all vertexes in edge list format? Could making connection with the synset itself be a solution? Like ..... koran.n.01 koran.n.01 .....

alexanderpanchenko commented 6 years ago

Yes, I think that self connections is a good choice here.

On Sat, May 19, 2018, 09:41 Oleksiy notifications@github.com wrote:

@akutuzov https://github.com/akutuzov do you know how to preserve all vertexes in edge list format? Could be making connection with the synset itself be a solution? Like ..... koran.n.01 koran.n.01 .....

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/5#issuecomment-390386932, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6voAw-C0J17M1sdWLP5wT7arHpe95ks5tz8y6gaJpZM4UBALn .

akutuzov commented 6 years ago

@snkohail @alexteua Yes, I suppose that nodes with no edges are simply lost during conversion to edgelist. As far as I can see, node2vec accepts only edgelist as an input. Then you indeed can just add self connections to such nodes and then make sure that they do appear in the final edgelist.

akutuzov commented 6 years ago

@snkohail Have you managed to include all 82 116 noun synsets in your edge list?

snkohail commented 6 years ago

@akutuzov, so I noticed that there are 7542 missing from the embeddings list (with degree = 0) so what I did is to add a self connection for each one of them. I also tested the final number of nodes in both the adjlist and edgelist (before and after conversion to edgelist) to make sure they have the same number of unique nodes. Now I am running node2vec on the new converted edgelist. Thanks all for your comments

akutuzov commented 6 years ago

Great, looking forward to see the resulting scores!

akutuzov commented 6 years ago

Hi @snkohail Any news about the node2vec embeddings?

snkohail commented 6 years ago

@akutuzov results according to max_jcn_brown_human are in the sheet

akutuzov commented 6 years ago

Thanks @snkohail Why there is Nan for some models?

snkohail commented 6 years ago

There were a segmentation problem when running some models. I fixed it already.. I will try to upload the embeddings by node2vec (nodes converted to original names) to my drive and put a link.. Meanwhile running the evaluation again..

akutuzov commented 6 years ago

Yes, can you please upload somewhere the models with 25 walks, context size 10 and 10 epochs for the following dimensions: 50, 100, 200, 300?

If you also can train models with dimensionality 600, that would be great, but first we need the models I mentioned above. Thanks!

snkohail commented 6 years ago

requested models: https://drive.google.com/open?id=1V8V5_yiQrleugQQo6grr2qVhj5nfKXmA

akutuzov commented 6 years ago

It seems that using directed graph is useless, right? How did you define the direction of edges, by the way?

snkohail commented 6 years ago

Yes. I noticed that as well. The direction is a parameter that you can enable/disable in node2vec. I think it takes the same order of edges in the wordnet.edgelist .. I just used the resulted file after converting from adjlist.

alexteua commented 6 years ago

@akutuzov same for DeepWalk, btw

snkohail commented 6 years ago

for some experiments (see yellow records in the sheet), I get this error ?

2018-05-21 23:32:00,787 : INFO : Pairs with unknown words ratio: 0.0% 2018-05-21 23:32:04,702 : INFO : loading projection weights from converted_emb/converted_wordnet.300.25.10.5.T.emb 2018-05-21 23:32:25,762 : INFO : loaded (82115, 300) matrix from converted_emb/converted_wordnet.300.25.10.5.T.emb 2018-05-21 23:32:25,762 : INFO : precomputing L2-norms of word weight vectors /usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py:1045: RuntimeWarning: overflow encountered in square self.vectors[i, :] /= sqrt((self.vectors[i, :] 2).sum(-1)) /usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py:1045: RuntimeWarning: invalid value encountered in true_divide self.vectors[i, :] /= sqrt((self.vectors[i, :] 2).sum(-1)) /usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py:32: RuntimeWarning: overflow encountered in reduce return umr_sum(a, axis, dtype, out, keepdims) Traceback (most recent call last): File "evaluation.py", line 18, in static_synset_score = model.evaluate_word_pairs(simfile, dummy4unknown=True) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 1014, in evaluate_word_pairs spearman = stats.spearmanr(similarity_gold, similarity_model) File "/usr/local/lib/python3.5/dist-packages/scipy/stats/stats.py", line 3301, in spearmanr rho, pval = mstats_basic.spearmanr(a, b, axis) File "/usr/local/lib/python3.5/dist-packages/scipy/stats/mstats_basic.py", line 459, in spearmanr raise ValueError("The input must have at least 3 entries!") ValueError: The input must have at least 3 entries!

akutuzov commented 6 years ago

I saw it as well, and it happens only with the models trained on directed graphs. Probably, some vectors left uninitialized, or something like this. As these models are inferior anyway, I think we can just ignore this.

akutuzov commented 6 years ago

@snkohail Can you please have a look at the paper draft and add a very brief description of how node2vec models were trained? This should be in the subsection 5.2, right after the paragraph about Deepwalk authored by @alexteua. Follow more or less the same description format, make sure to include all non-obvious decisions that you had to make during the training. This can be done tomorrow, but preferably in the first half of the day.

akutuzov commented 6 years ago

@snkohail and can you also test window size 70, as in the Deepwalk experiments?

snkohail commented 6 years ago

Yes. I am running the experiments now with 600 dimensions and I also added the 70. I'll update you once it's done.

snkohail commented 6 years ago

I added some results for 600 embeddings and others with 70 context size. This is the files that I got so far. Running node2vec is taking forever with higher dimensions and increasing context size. If you want a specific model please let me know.

akutuzov commented 6 years ago

Thanks! Yes, I need the context 70 models to download (with vector sizes 100, 200, 300 and 600), if it is possible.

akutuzov commented 6 years ago

@snkohail do we have the models?

snkohail commented 6 years ago

@akutuzov do you have access to our gpu ? I can't download one file per "scp". it takes forever, the VPN keeps disconnecting.

akutuzov commented 6 years ago

@snkohail as far as I understand, http://ltdata1.informatik.uni-hamburg.de/shortest_path/ is available from outside, you can put the models there.

snkohail commented 6 years ago

@akutuzov , I moved some models. Sorry, 600 with 5 epochs is still running..

akutuzov commented 6 years ago

Thanks! For the future, it's better to upload gzipped versions of the models.

akutuzov commented 6 years ago

It seems that the window 70 models do not change the overall picture: they are somewhat better on lower dimensionalities but much worse on higher ones. So I think we will stick to the current hyperparameters in the paper, but still keep these new models and may be use them in the camera-ready version (if accepted).

snkohail commented 6 years ago

I moved the 600.10.70.5.F model.