nltk / nltk_data

NLTK Data
1.4k stars 1.03k forks source link

WordNet key ERROR on ADJ for Resnik Similarity #185

Closed scramblingbalam closed 2 years ago

scramblingbalam commented 2 years ago

I'm trying to find the Resnik similarity of every word in a sentence to create a measure of sentence similarity and the Resnik Similarity fails on Adjectives based on crowdsourcing narrative intelligence

`import itertools import nltk from nltk.corpus import wordnet as wn

nltk.download()

from nltk.corpus import wordnet_ic

brown_ic = wordnet_ic.ic('ic-brown.dat')

semcor_ic = wordnet_ic.ic('ic-semcor.dat')

from nltk.corpus import genesis genesis_ic = wn.ic(genesis, False, 0.0) print(wn.synsets("daily")) synsetsA = wn.synsets("daily", pos=wn.ADJ) synsetsB = wn.synsets("daily", pos=wn.ADJ) print(synsetsA) print(max(list(i[0].res_similarity(i[1],genesis_ic) for i in itertools.product(synsetsA,synsetsB)))) it throws a key error from WordNetWordNetError: Information content file has no entries for part-of-speech: s`

the full traceback is

runfile('C:/Users/ik211f/Documents/python/wordnetERROR.py', wdir='C:/Users/ik211f/Documents/python')
[Synset('daily.n.01'), Synset('daily.s.01'), Synset('casual.s.03'), Synset('daily.r.01'), Synset('day_by_day.r.01')]
[Synset('daily.s.01'), Synset('casual.s.03')]
Traceback (most recent call last):

  File "C:\Users\ik211f\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\corpus\reader\wordnet.py", line 2382, in information_content
    icpos = ic[synset._pos]

KeyError: 's'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\ik211f\Documents\python\wordnetERROR.py", line 20, in <module>
    print(max(list(i[0].res_similarity(i[1],genesis_ic) for i in itertools.product(synsetsA,synsetsB))))

  File "C:\Users\ik211f\Documents\python\wordnetERROR.py", line 20, in <genexpr>
    print(max(list(i[0].res_similarity(i[1],genesis_ic) for i in itertools.product(synsetsA,synsetsB))))

  File "C:\Users\ik211f\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1014, in res_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)

  File "C:\Users\ik211f\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\corpus\reader\wordnet.py", line 2363, in _lcs_ic
    ic1 = information_content(synset1, ic)

  File "C:\Users\ik211f\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\corpus\reader\wordnet.py", line 2385, in information_content
    raise WordNetError(msg % synset._pos) from e

WordNetError: Information content file has no entries for part-of-speech: s

the part of speech denoted by 's' is not documented at WordNet How To page nor is the inability to create similarities for some Adjectives or Adjective Satellites. Since it seems that the problem is with Adjective Satellites the error may be related to #2442 I'm not great at reading module code but it looks like it should be handled at wordnet line 2101 to 2104 for ss in possible_synsets: pos = ss._pos if pos == ADJ_SAT: pos = ADJ based on the variable assignment at line 68 ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v"

ekaf commented 2 years ago

@scramblingbalam, the information_content() function in wordnet.py is not yet handling adjective satellites. So it is a nltk/nltk issue, because nothing needs to be changed in nltk_data.

After fixing the satellite problem, your example could work on the IC scores that you calculate with the Wordnet library. But it could not work with the wordnet_ic corpora, because these define only Information Content for nouns and verbs, not adjectives:

    def ic(self, icfile):
        """
        Load an information content file from the wordnet_ic corpus
        and return a dictionary.  This dictionary has just two keys,
        NOUN and VERB, whose values are dictionaries that map from
        synsets to information content values.

        :type icfile: str
        :param icfile: The name of the wordnet_ic file (e.g. "ic-brown.dat")
        :return: An information content dictionary
        """