Closed mbarbouch closed 4 years ago
Hi,
If I understood you correctly, no changes in the code are needed. You just have to prepare a version of the training dataset with whatever training pairs and distances you need.
Hi,
If I understood you correctly, no changes in the code are needed. You just have to prepare a version of the training dataset with whatever training pairs and distances you need.
Thanks for your quick response. Alright, so you mean I have to prepare a dataset like this https://ltnas1.informatik.uni-hamburg.de:8081/owncloud/index.php/s/lhcJQNxaGBLjL8o?path=%2Fdatasets, then run embeddings.py?
As Shortest Path approach achieved the best results, I want to prepare a dataset for that measure. I guess I have to follow the next steps:
Is this correct?
Yes, you are right, you should use these scripts. Item 2 (build a matrix of pair synsets) is not actually needed, compute_paths_neighb.py
will do that for you.
Yes, you are right, you should use these scripts. Item 2 (build a matrix of pair synsets) is not actually needed,
compute_paths_neighb.py
will do that for you.
Thank you. It's already running... (Indeed, the nested for loop for found synsets is actually doing the 'matrix' stuff.)
I also noticed that 1 is not needed for computing the similarities; it is already included in compute_paths_neighb.py
. However, I think 1. is still useful as it will help embeddings.py
to speed up its training process for not figuring out the vocab list itself.
Furthermore, I saw walk_rank = 2 # Order of graph neighbors to consider
in compute_paths_neighb.py
. Is this the same value used for obtaining the results presented in the paper?
Yes, we used 2nd order graph neighbors. More details are in this paper, section 7.
Yes, we used 2nd order graph neighbors. More details are in this paper, section 7.
Great. Then I leave the values with their defaults.
I still have one question about WSD. I think it is better to open another thread, as it is a different topic...
In the end I also ran the full compute_paths.py
. After exporting the similarities, I was comparing them to those provided in the ShP trainset at: https://ltnas1.informatik.uni-hamburg.de:8081/owncloud/index.php/s/lhcJQNxaGBLjL8o?path=%2Fdatasets.
I've seen a lot of differences, like:
entity.n.01 group.n.01 0.5833333333333333
vs. entity.n.01 group.n.01 0.3333333333333333
(using wn.synset('group.n.01').path_similarity(wn.synset('entity.n.01'))
)contact.n.02 plunk.n.02 0.37499999999999994
vs. contact.n.02 plunk.n.02 0.25
(In addition, not all noun synset pairs provided in the ShP dataset are present in the export I got (and vice versa).)
Any explanations for this?
This is weird. I agree that some distances in the published ShP dataset are different from what NLTK WordNet yields now. My only guess at the moment is that some changes were introduced in the WordNet itself during these 2 years, since we created these datasets. This would also explain why you get a different set of synset pairs (the immediate neighbors have changed).
Anyway, using compute_paths_neighb.py
you can easily re-create the datasets in a matter of minutes.
That could be a reason indeed. I was more thinking of getting values mixed up between parallel processes in which pairwise similarities are calculated, but I think my guess is not true as I verified some mismatches and it turned out that the values are correct when executing WN .path_similarity
for individual cases.
The reason I switched to compute_paths.py
is because I had some mismatches in the compute_paths_neighb.py
run. It took a long time to extract the final pruned file... However, that didn't make much difference for the 'noun' part. I got a bigger file, however, (~4.7mil vs. ~2.9mil when using compute_paths_neighb.py
), which is good for the training, I think.
compute_paths_neighb.py
has outputted a model of ~88k trained synsets. This is below the total number of synsets in WordNet, which is ~117k. At some point I wanted to experiment with only adjectives, but it turned out that the model doesn't contain them at all. The same applies to adverbs. The 88k consists of only nouns (~75k) and verbs (~13k).
I thought maybe this is due to neighbor pruning. So I ran compute_paths.py
as well. Although the latter got stuck at some synset pairs, the vocab size was about 95k. So this is also <117k.
Is there something that I can set in order to take all synsets? How did you manage to cover all nouns in the published model?
Did you change this line (telling the script to use only noun synsets)?
Did you change this line (telling the script to use only noun synsets)?
That is exactly the line I've changed to take all parts-of-speech, but the number of nouns in the model was about 75k, below the total number (of nouns) of 82k. So the expectation is when considering only nouns that it should cover all nouns?
I can do that for nouns, but my aim is to have verbs, adverbs and adjectives as well. All synsets actually. It seems a bit weird when doing that, the number of synsets per pos becomes lower. I'll see if I can find something else...
7 448 of noun synsets in WordNet do not have any neighbors (no edges are attached to these nodes in the graph). This is why you get only about 75k noun synsets in the resulting model.
7 448 of noun synsets in WordNet do not have any neighbors (no edges are attached to these nodes in the graph). This is why you get only about 75k noun synsets in the resulting model.
Aha, so that explains the lower size. (My guess was also that this happened during nighbor pruning, because two synsets are too far away from each other, or there would be no path between them at all.) However, this published model on the first page http://ltdata1.informatik.uni-hamburg.de/path2vec/embeddings/shp_embeddings.vec.gz does contain all 82k noun synsets! So I was wondering how could you cover all of them, even when some do not have any neighbors?
The model contains all the synsets because it is initialized with the vocabulary containing all the synsets.
It is just that there's no training data for the synsets without immediate neighbors, so their embeddings will remain as they were after the initialization stage (being essentially random).
NB: in fact, even if a noun synset does not have any immediate neighbors, WordNet still considers it to be a child of the entity.n.01
root synset. This relation can be extracted in NLTK using the root_hypernyms()
method, but we do not use it in compute_paths_neighb.py
.
Ah, I see, so the synsets for which there is no path keep just their random initialized embeddings(?).
NB: in fact, even if a noun synset does not have any immediate neighbors, WordNet still considers it to be a child of the
entity.n.01
root synset.
True. I just don't think it is useful, as it won't add any information to any concrete synset. In the end everyone can think of everything is an entity. In my case I need the distinctive semantic information in order to extend an existing model that is relying on implicit language knowledge absorbed by unsupervised training.
Anyhow, I really appreciate your active involvement and helpful responses. Thank you.
Yes, if we have no training pairs for a synset, then it just keeps its randomly initialized embedding.
Hello Andrey, Mohamed,
This is exactly what I want to do! I could see that Mohamed has already created embedding for all the synsets from WordNet.
Will it be possible to give steps to train the model? I guess, it will be helpful for other people who want to use these embeddings for all synsets?
If it is not possible, can I request to get the embeddings you have trained over all the synsets present in WordNet?
Thanks! -Onkar
Hi,
First of all thanks for publishing this code. Second, I want to re-train the model on the entire WordNet, taking all parts-of-speech into account. Where do I need to make changes to accomplish this?
Thanks in advance.