Training the model on the entire WordNet

mbarbouch commented 4 years ago

Hi,

First of all thanks for publishing this code. Second, I want to re-train the model on the entire WordNet, taking all parts-of-speech into account. Where do I need to make changes to accomplish this?

Thanks in advance.

akutuzov commented 4 years ago

Hi,

If I understood you correctly, no changes in the code are needed. You just have to prepare a version of the training dataset with whatever training pairs and distances you need.

mbarbouch commented 4 years ago

Hi,

If I understood you correctly, no changes in the code are needed. You just have to prepare a version of the training dataset with whatever training pairs and distances you need.

Thanks for your quick response. Alright, so you mean I have to prepare a dataset like this https://ltnas1.informatik.uni-hamburg.de:8081/owncloud/index.php/s/lhcJQNxaGBLjL8o?path=%2Fdatasets, then run embeddings.py?

As Shortest Path approach achieved the best results, I want to prepare a dataset for that measure. I guess I have to follow the next steps:

extract all vacabulary synsets from WordNet. Using create_voc.py?
build a matrix of pair synsets.
compute ShP similarty for all synset pairs. Using compute_paths_neighb.py?
prune the matrix to max 50 closest synset neighbors per synset. Using prune_by_neighbors.py?
export the computed similarity values and use embeddings.py for training the embeddings?

Is this correct?

akutuzov commented 4 years ago

Yes, you are right, you should use these scripts. Item 2 (build a matrix of pair synsets) is not actually needed, compute_paths_neighb.py will do that for you.

mbarbouch commented 4 years ago

Yes, you are right, you should use these scripts. Item 2 (build a matrix of pair synsets) is not actually needed, compute_paths_neighb.py will do that for you.

Thank you. It's already running... (Indeed, the nested for loop for found synsets is actually doing the 'matrix' stuff.)

I also noticed that 1 is not needed for computing the similarities; it is already included in compute_paths_neighb.py. However, I think 1. is still useful as it will help embeddings.py to speed up its training process for not figuring out the vocab list itself.

Furthermore, I saw walk_rank = 2 # Order of graph neighbors to consider in compute_paths_neighb.py. Is this the same value used for obtaining the results presented in the paper?

akutuzov commented 4 years ago

Yes, we used 2nd order graph neighbors. More details are in this paper, section 7.

mbarbouch commented 4 years ago

Yes, we used 2nd order graph neighbors. More details are in this paper, section 7.

Great. Then I leave the values with their defaults.

mbarbouch commented 4 years ago

I still have one question about WSD. I think it is better to open another thread, as it is a different topic...

mbarbouch commented 4 years ago

In the end I also ran the full compute_paths.py. After exporting the similarities, I was comparing them to those provided in the ShP trainset at: https://ltnas1.informatik.uni-hamburg.de:8081/owncloud/index.php/s/lhcJQNxaGBLjL8o?path=%2Fdatasets.

I've seen a lot of differences, like:

entity.n.01 group.n.01 0.5833333333333333 vs. entity.n.01 group.n.01 0.3333333333333333 (using wn.synset('group.n.01').path_similarity(wn.synset('entity.n.01')))
contact.n.02 plunk.n.02 0.37499999999999994 vs. contact.n.02 plunk.n.02 0.25

(In addition, not all noun synset pairs provided in the ShP dataset are present in the export I got (and vice versa).)

Any explanations for this?

akutuzov commented 4 years ago

This is weird. I agree that some distances in the published ShP dataset are different from what NLTK WordNet yields now. My only guess at the moment is that some changes were introduced in the WordNet itself during these 2 years, since we created these datasets. This would also explain why you get a different set of synset pairs (the immediate neighbors have changed).

Anyway, using compute_paths_neighb.py you can easily re-create the datasets in a matter of minutes.

mbarbouch commented 4 years ago

That could be a reason indeed. I was more thinking of getting values mixed up between parallel processes in which pairwise similarities are calculated, but I think my guess is not true as I verified some mismatches and it turned out that the values are correct when executing WN .path_similarity for individual cases.

The reason I switched to compute_paths.py is because I had some mismatches in the compute_paths_neighb.py run. It took a long time to extract the final pruned file... However, that didn't make much difference for the 'noun' part. I got a bigger file, however, (~4.7mil vs. ~2.9mil when using compute_paths_neighb.py), which is good for the training, I think.

mbarbouch commented 4 years ago

compute_paths_neighb.py has outputted a model of ~88k trained synsets. This is below the total number of synsets in WordNet, which is ~117k. At some point I wanted to experiment with only adjectives, but it turned out that the model doesn't contain them at all. The same applies to adverbs. The 88k consists of only nouns (~75k) and verbs (~13k).

I thought maybe this is due to neighbor pruning. So I ran compute_paths.py as well. Although the latter got stuck at some synset pairs, the vocab size was about 95k. So this is also <117k.

Is there something that I can set in order to take all synsets? How did you manage to cover all nouns in the published model?

akutuzov commented 4 years ago

Did you change this line (telling the script to use only noun synsets)?

mbarbouch commented 4 years ago

Did you change this line (telling the script to use only noun synsets)?

That is exactly the line I've changed to take all parts-of-speech, but the number of nouns in the model was about 75k, below the total number (of nouns) of 82k. So the expectation is when considering only nouns that it should cover all nouns?

I can do that for nouns, but my aim is to have verbs, adverbs and adjectives as well. All synsets actually. It seems a bit weird when doing that, the number of synsets per pos becomes lower. I'll see if I can find something else...

akutuzov commented 4 years ago

7 448 of noun synsets in WordNet do not have any neighbors (no edges are attached to these nodes in the graph). This is why you get only about 75k noun synsets in the resulting model.

mbarbouch commented 4 years ago

7 448 of noun synsets in WordNet do not have any neighbors (no edges are attached to these nodes in the graph). This is why you get only about 75k noun synsets in the resulting model.

Aha, so that explains the lower size. (My guess was also that this happened during nighbor pruning, because two synsets are too far away from each other, or there would be no path between them at all.) However, this published model on the first page http://ltdata1.informatik.uni-hamburg.de/path2vec/embeddings/shp_embeddings.vec.gz does contain all 82k noun synsets! So I was wondering how could you cover all of them, even when some do not have any neighbors?

akutuzov commented 4 years ago

The model contains all the synsets because it is initialized with the vocabulary containing all the synsets. It is just that there's no training data for the synsets without immediate neighbors, so their embeddings will remain as they were after the initialization stage (being essentially random). NB: in fact, even if a noun synset does not have any immediate neighbors, WordNet still considers it to be a child of the entity.n.01 root synset. This relation can be extracted in NLTK using the root_hypernyms() method, but we do not use it in compute_paths_neighb.py.

mbarbouch commented 4 years ago

Ah, I see, so the synsets for which there is no path keep just their random initialized embeddings(?).

NB: in fact, even if a noun synset does not have any immediate neighbors, WordNet still considers it to be a child of the entity.n.01 root synset.

True. I just don't think it is useful, as it won't add any information to any concrete synset. In the end everyone can think of everything is an entity. In my case I need the distinctive semantic information in order to extend an existing model that is relying on implicit language knowledge absorbed by unsupervised training.

Anyhow, I really appreciate your active involvement and helpful responses. Thank you.

akutuzov commented 4 years ago

Yes, if we have no training pairs for a synset, then it just keeps its randomly initialized embedding.

oapandit commented 3 years ago

Hello Andrey, Mohamed,

This is exactly what I want to do! I could see that Mohamed has already created embedding for all the synsets from WordNet.

Will it be possible to give steps to train the model? I guess, it will be helpful for other people who want to use these embeddings for all synsets?

If it is not possible, can I request to get the embeddings you have trained over all the synsets present in WordNet?

Thanks! -Onkar

uhh-lt / path2vec

Training the model on the entire WordNet #27