Evaluation based on word sense disambiguation

alexanderpanchenko commented 6 years ago

Motivation

We plan to submit a paper to the EMNLP conference http://emnlp2018.org about the graph embedding approach implemented in this repository. In this experiment, we learn using our model vector representations of WordNet nodes (synsets). Each vector is in the word2vec format.

One of the evaluations methods of the obtained word embeddings is to apply them to perform word sense disambiguation (https://en.wikipedia.org/wiki/Word-sense_disambiguation). In particular, it is straightforward to do it with a pre-exisiting graph based algorithm described in the following paper: http://www.cse.unt.edu/~tarau/research/misc/wsd/papers/sinha.ieee07.pdf

Unfortunately, the implementation of the algorithm is not available. The goal is to re-implement the algorithm presented in the paper and to reproduce the experiment presented in this paper. After this, the goal would be to replace the similarity metrics used in this paper based on the WordNet with the graph embedddings similarities and to show that the difference in the results is small.

Implementation

Read first this paper carefully to understand the method which has to be implemented. Make sure you understand the Algorithm 1.
Implement in the Python 3 programming language the word sense disambiguation (WSD) presented in the paper in Algorithm 1. Use the NetworkX library to (a) use the graph structure, (b) rely on the PageRank algorithm implementation. https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html .

Implement two options of the algorithm: using the PageRank and InDegree (see the paper). Do not implement other versions of the algorithm.
Use NLTK library for implementations of the JCN (Jiang-Conrath) and LCH (Leocock-Chodorow) similarity measures between words: See the 'similarity' here: http://www.nltk.org/howto/wordnet.html. Do not implement options for other measures described in the paper.

For conducting the evalution experiment on SENSEVAL 2/3 datasets, rely on this framework: http://lcl.uniroma1.it/wsdeval/evaluation-data
Report the results in a Google Sheets table. Compare them to the original scores from the paper (they should be reasonably close).
Use the gensim library to load graph embeddings of the synsets. Replace the JCN and LCH word similarity measures with their vectorized versions. Models in the word2vec format are available here: http://ltdata1.informatik.uni-hamburg.de/shortest_path/models/
Make a version of the algorithm which uses not the original JCN / LCH measures but their vectorized counterparts.
Report in the Google Sheet performance of the WSD algorithm on the same dataset based on the vector-based synset similarities.

akutuzov commented 6 years ago

Another option for graph processing (in addition to networkx) is the Igraph library.

alexanderpanchenko commented 6 years ago

Does it have page rank?

On Sat, Apr 28, 2018, 22:21 Andrey Kutuzov notifications@github.com wrote:

Another option for graph processing (in addition to networkx) is the Igraph library.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/2#issuecomment-385202993, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vg5JTRJry2rCnnfM35b4fXF49BgSks5ttM9UgaJpZM4TrhIq .

akutuzov commented 6 years ago

Sure, http://igraph.org/python/doc/igraph.Graph-class.html#pagerank

akutuzov commented 6 years ago

The resulting embeddings should be evaluated either on the variants of SimLex with lemmas transformed into synsets (three datasets with different WordNet similarity metrics used to find the correct synsets) or on the original SimLex using the special evaluation script which select synsets dynamically.

See the details in the README for this repository.

akutuzov commented 6 years ago

Please use the following graph embeddings for WSD evaluation:

For JCN on SemCor:

For JCN on Brown Corpus:

For LCH:

alexanderpanchenko commented 6 years ago

https://radimrehurek.com/gensim/models/word2vec.html

alexanderpanchenko commented 6 years ago

akutuzov commented 6 years ago

@m-dorgham can you describe how to run your WSD evaluation script on arbitrary embeddings? And can you push the senseval2 dataset that you used to a separate subdirectory in this repository?

m-dorgham commented 6 years ago

@akutuzov I pushed the senseval datasets under 'data/senseval'. For using the code (v2) you just need to modify the file paths of senseval xml file, gold keys results and the embedding file. You might want to play with the flags USE_JCN and VECTORIZED_SIMILARITY to switch between jcn&lch and between using the embeddings or the original jcn&lch.

akutuzov commented 6 years ago

OK, thanks!

akutuzov commented 6 years ago

@m-dorgham strangely, when I run your graph_wsd_test_v2.py, I get different results from those in the sheet. For instance, with non-vectorized LCH it is F-score=0.5342, not 0.5473. I also get 0.5263 for lch-thresh15-near50_embeddings_vsize300_bsize20_lr005 instead of 0.5403, and 0.5206 for lch-thresh15-near50_embeddings_vsize200_bsize100_lr005 instead of 0.5267. Any ideas why is that so?

m-dorgham commented 6 years ago

@akutuzov This is weird! .. I don't know what's the reason! Maybe the difference in precision between our environments? but the difference should not lead to this big difference in scores!

could you tell me what's your parameter configuration for the original lch? is it like the following: USE_POS_INFO = True USE_JCN = False #if False, use lch VECTORIZED_SIMILARITY = False USE_PAGERANK = False AVG_METHOD = 'micro' MAX_DEPTH = 3

My Python version is 3.6.5. What about yours?

Also my nltk version is 3.3, and networkx version is 2.1.

akutuzov commented 6 years ago

Yes, exactly this. The difference is only that I use Python 3.5. In fact, I now see that the results are non-deterministic (i.e., they vary from run to run on the same set of parameters). It seems that there are some probabilistic decisions either in the WSD algorithm or in the way you evaluate it. @alexanderpanchenko is it expected?

Anyway, can you please then run evaluation, for example, 5 times for each variant and report average and standard deviation?

alexanderpanchenko commented 6 years ago

I am surprised by randomness. Is it because of centrality?

On Fri, May 18, 2018, 22:40 Andrey Kutuzov notifications@github.com wrote:

Yes, exactly this. The difference is only that I use Python 3.5. In fact, I now see that the results are non-deterministic (i.e., they vary from run to run on the same set of parameters). It seems that there are some probabilistic decisions either in the WSD algorithm or in the way you evaluate it. @alexanderpanchenko https://github.com/alexanderpanchenko is it expected?

Anyway, can you please then run evaluation, for example, 5 times for each variant and report average and standard deviation?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/2#issuecomment-390326028, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vkNao0G5kuiz29qqZkyl12A7Abftks5tzzGngaJpZM4TrhIq .

m-dorgham commented 6 years ago

each time I run it I get the same results. And I don't think there is any random part in the algorithm, I think all the steps are deterministic.

m-dorgham commented 6 years ago

@alexanderpanchenko can you run it and tell us the results?

m-dorgham commented 6 years ago

@akutuzov What about nltk and networkx versions? my nltk version is 3.3, and networkx version is 2.1.

alexanderpanchenko commented 6 years ago

Sorry. Can't run it now.

On Fri, May 18, 2018, 22:49 Mohammad Dorgham notifications@github.com wrote:

@akutuzov https://github.com/akutuzov What about nltk and networkx versions? my nltk version is 3.3, and networkx version is 2.1.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/2#issuecomment-390328364, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6voVP8lqxVENGz_TLGxyRzM5qQ_HJks5tzzPhgaJpZM4TrhIq .

m-dorgham commented 6 years ago

@akutuzov The only logical reason that I can think of right now is that maybe our 'nltk' are using different versions of wordnet (which means some differences in the synsets).

My wordnet version is 3.0.

You can get the version using the following:

from nltk.corpus import wordnet as wn wn.get_version()

akutuzov commented 6 years ago

@m-dorgham as I've said, my software versions are exactly the same as yours (including Wordnet), except for Python itself. I tried to run it in Python 2.7, and the results are deterministic there (although still not the same as yours). Weird. Can it have something to do with the ordering of dictionary keys? Do you rely on it?

m-dorgham commented 6 years ago

@akutuzov yes I rely on it. that's why I'm using OrderedDict to guarantee the insertion and retrieval order across different implementations of python. But maybe it's not implemented this way in python 3.5.

m-dorgham commented 6 years ago

Hey @akutuzov I modified the code of graph_wsd_test_v2.py and removed the use of OrderedDict. Please try now and tell me the result.

akutuzov commented 6 years ago

Hi @m-dorgham No, it still produces non-deterministic results with Python 3.5. Checked on two machines with different Linux distributions. This is strange, I'm looking into it. Are you sure you refactored all cases when you rely on dict keys ordering?

NB: I polished the WSD script a bit in the last two commits. Most important, it now accepts modelfile name as a command line argument. It also looks for senseval2 data in the directories within the repository.

akutuzov commented 6 years ago

It seems that this condition fluctuates in Python 3.5. On the same test data this intersection can contain 1 or 0 elements, depending on the run. This is the source of different scores.

The gold keys set is always stable, it is the predicted keys set which is different from run to run. For example, in one run it can be ['bell_ringer%1:18:01::', 'ringer%1:18:02::', 'toller%1:18:01::'] but in another it is ['clone%1:18:00::', 'dead_ringer%1:18:00::', 'ringer%1:18:01::']

Thus, there is some randomness in how you create the disambiguated data.

m-dorgham commented 6 years ago

Hi @akutuzov It's really weird, I will try to figure it out. But just in case we couldn't figure it out, could you please install python 3.6 and test the script on it? I think you can install v3.6 alongside v3.5 while still having 3.5 as your default python. If this is not possible then could you install it in a docker or a virtual environment. It seems that there maybe some differences in implementation between 3.6 and 3.5. I don't recommend testing the script on python 2.7 because definitely there are differences in some functions implementation (some of them I encountered while writing this script).

alexanderpanchenko commented 6 years ago

this is not really wierd: they changed it in python 3.6

https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6

On Sat, May 19, 2018 at 2:35 PM Mohammad Dorgham notifications@github.com wrote:

Hi @akutuzov https://github.com/akutuzov It's really weird, I will try to figure it out. But just in case we couldn't figure it out, could you please install python 3.6 and test the script on it? I think you can install v3.6 alongside v3.5 while still having 3.5 as your default python. If this is not possible then could you install it in a docker or a virtual environment. It seems that there maybe some differences in implementation between 3.6 and 3.5. I don't recommend testing the script on python 2.7 because definitely there are differences in some functions implementation (some of them I encountered while writing this script).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/2#issuecomment-390402081, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vkoJTWsHnSYkYjs_Jab8jElknxxdks5t0BGrgaJpZM4TrhIq .

akutuzov commented 6 years ago

Yes, of course I can install Python 3.6. But I would rather find out why are the results non-deterministic in 3.5. Because otherwise it means that in our scores we rely on idiosyncrasies of dictionary implementations in Python versions which is not good. If the algorithm is deterministic, it should be implemented in a deterministic way, I think.

alexanderpanchenko commented 6 years ago

see the link above: i guess that only in 3.6 order is guaranteed (before it was not...): anyways relying on this order is not a good thing to do :-)

On Sat, May 19, 2018 at 2:39 PM Andrey Kutuzov notifications@github.com wrote:

Yes, of course I can install Python 3.6. But I would rather find out why are the results non-deterministic in 3.5. Because otherwise it means that in our scores we rely on idiosyncrasies of dictionary implementations in Python versions which is not good. If the algorithm is deterministic, it should be implemented in a deterministic way, I think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/shortpath2vec/issues/2#issuecomment-390402258, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vuIbMyDc7Tt-cVAtVYwAWjZUnz1Sks5t0BJ1gaJpZM4TrhIq .

akutuzov commented 6 years ago

@alexanderpanchenko Yes, I mentioned this 3.6 change before, but I think we shouldn't rely on this ordering (which is in fact guaranteed only for 3.7, and the 3.6 docs explicitly don't recommend to rely upon it). @m-dorgham can you please find out where in your code you use dictionary keys ordering and re-implement it without reliance on this?

m-dorgham commented 6 years ago

I replaced the code that depends on the order of the dict elements, so this is not the case now. I guess there is some python function that is implemented differently between 3.5 and 3.6

akutuzov commented 6 years ago

@m-dorgham Are you sure you do not rely on dict ordering in some other parts of your code? It is dangerous to use scores which vary from one Python version to another.

m-dorgham commented 6 years ago

@akutuzov try the script now please. I replaced all dicts with OrderedDict to guarantee the ordered behavior. tell me what's the result now.

akutuzov commented 6 years ago

Yes, now it seems deterministic (at least, for several consecutive runs). Thank @m-dorgham ! Can you please check that all the scores in the sheet are still valid with this new code?

m-dorgham commented 6 years ago

@akutuzov I pulled the latest code and reran the experiments and all the scores are the same as in the sheet. What about your scores? are they still different?

akutuzov commented 6 years ago

I checked only a couple of configurations and the scores there are the same as yours. So I would consider this issue closed, although it is still strange that dictionary ordering influences the evaluation results.

m-dorgham commented 6 years ago

That's great, finally we have the same scores :D The influence of the ordering is really weird because the parts I modified to OrderedDict should not depend on the ordering at all!

@akutuzov Please tell me if you got better scores with other models of the embeddings.

akutuzov commented 6 years ago

I tested a bunch of other models and added a couple of rows to the sheet, but not surprises there. All the results can be found in https://github.com/uhh-lt/shortpath2vec/tree/master/wsd (I moved all the WSD stuff to this directory).

@alexanderpanchenko I think we can close this issue now.

alexanderpanchenko commented 6 years ago

Great job, guys!!!

akutuzov commented 6 years ago

Hi @m-dorgham Can you also have a look at the paper draft and add a short paragraph about the nature of Sinha&Mihalcea WSD algorithm? This should go into section 4.1, just replace the words 'Here goes Mohammad's text about this WSD algorithm' with your paragraph.

P.S. Any other comments of fixes for the paper text are also welcome!

m-dorgham commented 6 years ago

Hello @akutuzov .. I wrote the paragraph, please review it and see if the length/content is fine or needs any modification. Feel free to modify if needed.

I fixed one spelling mistake at the end of page 1 where it was written 'we use an an input'. I changed the first 'an' to 'as'. I will read the paper again and tell you if I had any comments.

alexanderpanchenko commented 6 years ago

@m-dorgham @akutuzov : please include the plot @m-dorgham prepared into Section 4.1 and describe it shortly: it helps understanding a lot.

m-dorgham commented 6 years ago

@alexanderpanchenko I added the plot, please have a look.

akutuzov commented 6 years ago

@m-dorgham what was the original sentence from which the plot was produced?

akutuzov commented 6 years ago

Thanks for the paragraph, I think it is good to go (I rephrased it in some places and commented out one sentence). But I think it would be better to provide the original sentence for the plot.

m-dorgham commented 6 years ago

The whole sentence is "The parishioners of St. Michael and All Angels stop to chat at the church door, as members here always have." But the plot is only a subgraph for the target words (the words with labels): parishioners, stop, chat, church, door and members. I will add it to the figure.

m-dorgham commented 6 years ago

@akutuzov done adding the sentence to the paragraph and the figure, please review.

akutuzov commented 6 years ago

By the way, @m-dorgham can you suggest any way to determine the statistical significance of the differences between the scores of the models in this WSD task?

m-dorgham commented 6 years ago

@akutuzov sorry Andrey I forgot the methods, it's been a long time since I studies this stuff.

uhh-lt / path2vec

Evaluation based on word sense disambiguation #2

Motivation

Implementation