Closed alexanderpanchenko closed 6 years ago
I have a problem during evaluation:
kohail@ltgpu1:~/shortpath2vec$ python3 evaluation.py converted_emb/converted_wordnet.100.5.5.1.T.emb simlex/simlex_synsets/max_jcn_brown_human.tsv
[nltk_data] Downloading package wordnet to /home/kohail/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
2018-05-18 16:04:55,305 : INFO : loading projection weights from converted_emb/converted_wordnet.100.5.5.1.T.emb
2018-05-18 16:05:01,960 : INFO : loaded (74401, 100) matrix from converted_emb/converted_wordnet.100.5.5.1.T.emb
2018-05-18 16:05:01,960 : INFO : precomputing L2-norms of word weight vectors
2018-05-18 16:05:02,625 : INFO : Pearson correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: 0.0276
2018-05-18 16:05:02,625 : INFO : Spearman rank-order correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: -0.0004
2018-05-18 16:05:02,626 : INFO : Pairs with unknown words ratio: 1.2%
Traceback (most recent call last):
File "evaluation.py", line 19, in
do you have an embedding for this synset (koran.n.01)?
On Fri, May 18, 2018 at 4:06 PM Sarah notifications@github.com wrote:
I have a problem during evaluation: kohail@ltgpu1:~/shortpath2vec$ python3 evaluation.py converted_emb/converted_wordnet.100.5.5.1.T.emb simlex/simlex_synsets/max_jcn_brown_human.tsv [nltk_data] Downloading package wordnet to /home/kohail/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip. 2018-05-18 16:04:55,305 : INFO : loading projection weights from converted_emb/converted_wordnet.100.5.5.1.T.emb 2018-05-18 16:05:01,960 : INFO : loaded (74401, 100) matrix from converted_emb/converted_wordnet.100.5.5.1.T.emb 2018-05-18 16:05:01,960 : INFO : precomputing L2-norms of word weight vectors 2018-05-18 16:05:02,625 : INFO : Pearson correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: 0.0276 2018-05-18 16:05:02,625 : INFO : Spearman rank-order correlation coefficient against simlex/simlex_synsets/max_jcn_brown_human.tsv: -0.0004 2018-05-18 16:05:02,626 : INFO : Pairs with unknown words ratio: 1.2% Traceback (most recent call last): File "evaluation.py", line 19, in dynamic_synset_score = evaluate_synsets(model, 'simlex/simlex_original.tsv', logger, dummy4unknown=True) File "/home/kohail/shortpath2vec/evaluate_lemmas.py", line 53, in evaluate_synsets possible_similarity = emb_model.similarity(pair[0].name(), pair[1].name()) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 828, in similarity return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 169, in getitem return self.get_vector(entities) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 277, in get_vector return self.word_vec(word) File "/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py", line 274, in word_vec raise KeyError("word '%s' not in vocabulary" % word) KeyError: "word 'koran.n.01' not in vocabulary"
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/5#issuecomment-390218376, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vgFaAK8mLV-ttr4N5GnXQ5dNEn3Sks5tztVsgaJpZM4UBALn .
No. I checked.. its not there ..
@snkohail I had the same problem when exported graph to an edge list file, then it just didn't include the vertexes with no edges. (koran.n.01 has no hypernyms or hyponyms, for example)
you can use this adjlist file, it has all of them: https://github.com/uhh-lt/shortpath2vec/blob/master/deepwalk/wordnet.adjlist
Oh thanks.. I'll convert it and run node2vec again ..
@alexteua I still have the same problem .. When I convert the adjlist to edgelist and apply node2vec, no embeddings for the synset "koran" is produced...
here is a sample of the produced embeddings by node2vec: https://drive.google.com/open?id=1E9MwpHMKcQK7nGE22fvw9kK9C1qqIQlk
@snkohail I guess not connected vertexes are lost during converting from adjlist to edgelist (did you check edgelist you got for the "koran" synset?) also can node2vec implementation you are using consume adjlist format? can you send me the link to that implementation?
@akutuzov do you know how to preserve all vertexes in edge list format? Could making connection with the synset itself be a solution? Like ..... koran.n.01 koran.n.01 .....
Yes, I think that self connections is a good choice here.
On Sat, May 19, 2018, 09:41 Oleksiy notifications@github.com wrote:
@akutuzov https://github.com/akutuzov do you know how to preserve all vertexes in edge list format? Could be making connection with the synset itself be a solution? Like ..... koran.n.01 koran.n.01 .....
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/5#issuecomment-390386932, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6voAw-C0J17M1sdWLP5wT7arHpe95ks5tz8y6gaJpZM4UBALn .
@snkohail @alexteua Yes, I suppose that nodes with no edges are simply lost during conversion to edgelist. As far as I can see, node2vec accepts only edgelist as an input. Then you indeed can just add self connections to such nodes and then make sure that they do appear in the final edgelist.
@snkohail Have you managed to include all 82 116 noun synsets in your edge list?
@akutuzov, so I noticed that there are 7542 missing from the embeddings list (with degree = 0) so what I did is to add a self connection for each one of them. I also tested the final number of nodes in both the adjlist and edgelist (before and after conversion to edgelist) to make sure they have the same number of unique nodes. Now I am running node2vec on the new converted edgelist. Thanks all for your comments
Great, looking forward to see the resulting scores!
Hi @snkohail Any news about the node2vec embeddings?
@akutuzov results according to max_jcn_brown_human are in the sheet
Thanks @snkohail Why there is Nan for some models?
There were a segmentation problem when running some models. I fixed it already.. I will try to upload the embeddings by node2vec (nodes converted to original names) to my drive and put a link.. Meanwhile running the evaluation again..
Yes, can you please upload somewhere the models with 25 walks, context size 10 and 10 epochs for the following dimensions: 50, 100, 200, 300?
If you also can train models with dimensionality 600, that would be great, but first we need the models I mentioned above. Thanks!
requested models: https://drive.google.com/open?id=1V8V5_yiQrleugQQo6grr2qVhj5nfKXmA
It seems that using directed graph is useless, right? How did you define the direction of edges, by the way?
Yes. I noticed that as well. The direction is a parameter that you can enable/disable in node2vec. I think it takes the same order of edges in the wordnet.edgelist .. I just used the resulted file after converting from adjlist.
@akutuzov same for DeepWalk, btw
for some experiments (see yellow records in the sheet), I get this error ?
2018-05-21 23:32:00,787 : INFO : Pairs with unknown words ratio: 0.0%
2018-05-21 23:32:04,702 : INFO : loading projection weights from converted_emb/converted_wordnet.300.25.10.5.T.emb
2018-05-21 23:32:25,762 : INFO : loaded (82115, 300) matrix from converted_emb/converted_wordnet.300.25.10.5.T.emb
2018-05-21 23:32:25,762 : INFO : precomputing L2-norms of word weight vectors
/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py:1045: RuntimeWarning: overflow encountered in square
self.vectors[i, :] /= sqrt((self.vectors[i, :] 2).sum(-1))
/usr/local/lib/python3.5/dist-packages/gensim/models/keyedvectors.py:1045: RuntimeWarning: invalid value encountered in true_divide
self.vectors[i, :] /= sqrt((self.vectors[i, :] 2).sum(-1))
/usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py:32: RuntimeWarning: overflow encountered in reduce
return umr_sum(a, axis, dtype, out, keepdims)
Traceback (most recent call last):
File "evaluation.py", line 18, in
I saw it as well, and it happens only with the models trained on directed graphs. Probably, some vectors left uninitialized, or something like this. As these models are inferior anyway, I think we can just ignore this.
@snkohail Can you please have a look at the paper draft and add a very brief description of how node2vec models were trained? This should be in the subsection 5.2, right after the paragraph about Deepwalk authored by @alexteua. Follow more or less the same description format, make sure to include all non-obvious decisions that you had to make during the training. This can be done tomorrow, but preferably in the first half of the day.
@snkohail and can you also test window size 70, as in the Deepwalk experiments?
Yes. I am running the experiments now with 600 dimensions and I also added the 70. I'll update you once it's done.
I added some results for 600 embeddings and others with 70 context size. This is the files that I got so far. Running node2vec is taking forever with higher dimensions and increasing context size. If you want a specific model please let me know.
Thanks! Yes, I need the context 70 models to download (with vector sizes 100, 200, 300 and 600), if it is possible.
@snkohail do we have the models?
@akutuzov do you have access to our gpu ? I can't download one file per "scp". it takes forever, the VPN keeps disconnecting.
@snkohail as far as I understand, http://ltdata1.informatik.uni-hamburg.de/shortest_path/ is available from outside, you can put the models there.
@akutuzov , I moved some models. Sorry, 600 with 5 epochs is still running..
Thanks! For the future, it's better to upload gzipped versions of the models.
It seems that the window 70 models do not change the overall picture: they are somewhat better on lower dimensionalities but much worse on higher ones. So I think we will stick to the current hyperparameters in the paper, but still keep these new models and may be use them in the camera-ready version (if accepted).
I moved the 600.10.70.5.F model.
Download the WordNet graph here: http://ltdata1.informatik.uni-hamburg.de/shortest_path/graph/
Convert the adjlist file to edgelist file (which the node2vec takes as an input) using networkX
read: https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.readwrite.adjlist.read_adjlist.html#networkx.readwrite.adjlist.read_adjlist
Number of dimensions: 50, 100, 200, 300 Number of walks per source. 5, 10, 25 Context size for optimization: 5, 10, 25 Number of epochs in SGD: 1, 5, 10 Graph is directed: true, false
Overall, you need to output 4 3 3 3 2 models (the names of each embedding file should contain the hyperparameters).
You can do it using this script: https://github.com/uhh-lt/shortpath2vec/blob/master/deepwalk/convert_embedding.py
OR manually, like ...
Perform the evalution of each model using the evaluation script. See here details: https://github.com/uhh-lt/shortpath2vec.
Save the results in this table in the 'node2vec' sheet (similarly as for the deepwalk method): https://docs.google.com/spreadsheets/d/1KjNns16ld3pVUY1K7aA0HH9Lrb1PiKBH7-ZIbWrD9Kc/edit#gid=1803816318