luofuli / word-sense-disambiguation

Incorporating Dictionary Knowledge into Neural Word Sense Disambiguation(ACL 2018)
MIT License
66 stars 24 forks source link

full set of word senses missing in dictionary files? #2

Open yakazimir opened 4 years ago

yakazimir commented 4 years ago

I'm trying to rebuild your data, and noticed in the ALL.dict.xml (which, as I understand, contains all of the lemmas, glosses and word senses used in all the semeval data), you have entries such as the following:

<lexelt item="climate#n" pos="n" sence_count_wn="2" sense_count_corpus="1" word_example_count="5">
 <sense gloss="the weather in some location averaged over some long period of time" id="climate%1:26:00::" sense_example_count="5" sense_freq="5" synset="climate clime">
 </sense>
</lexelt>

Where climate#n is the lemma and pos. It says here that the sence_count_wn=2, however, there is only one sense inside of lexelt. Shouldn't there be all of the 2 sense entries inside of lexelt? My assumption is that each lexelt should have all of the different WN senses and glosses of the lemma listed in item.

I also notice that when I look up this word in nltk's wordnet (which I see that you also use), I get a different definition for climate%1:26:00:::

In [1]: from nltk.corpus import wordnet as wn 
In [2]: wn.synset_from_sense_key('climate%1:26:00::').definition()                         
'the prevailing psychological state'

## whereas your sense gloss seems to correspond to climate%1:26:01::
In [11]: wn.synset_from_sense_key('climate%1:26:01::').definition()                        
Out[11]: 'the weather in some location averaged over some long period of time'

In [13]: wn.get_version()                                                                   
Out[13]: '3.0'
yakazimir commented 4 years ago

Just an update: in terms of nltk's wordnet mapping using synset_from_sense_key, something seems to be wrong.

Your gloss/id pair is consistent with wordnet when I searched here: http://wordnetweb.princeton.edu/perl/webwn?s=climate&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=1&o3=&o4=&h=0000 .

This issue is mentioned here: https://github.com/nltk/nltk/issues/1934