nltk / nltk

NLTK Source
https://www.nltk.org
Apache License 2.0
13.42k stars 2.86k forks source link

Cannot get WordNet synsets for English without lemmatization #2421

Open goodmami opened 4 years ago

goodmami commented 4 years ago

Querying wordnet with wordnet.synsets() will lemmatize the query word, but only for English. While this is useful for many applications, sometimes I do not want such lemmatization. For instance, I have dictionary forms for multiple languages (from, e.g., Swadesh lists) and I want to detect differences in polysemy between languages, but the lemmatization inflates the apparent polysemy for English. There appears to be no way (in the public API) to do an English query without lemmatization.

For example:

>>> wn.synsets('eyeglasses')
[Synset('spectacles.n.01'), Synset('monocle.n.01')]
>>> wn.synsets('eyeglasses')[0].lemma_names()
['spectacles', 'specs', 'eyeglasses', 'glasses']
>>> wn.synsets('eyeglasses')[1].lemma_names()
['monocle', 'eyeglass']

The second synset (monocle.n.01) was found because 'eyeglass' appears in its lemmas, but not 'eyeglasses', which is only in the first synset. Sometimes specifying the POS can help, as with 'scissors' and 'scissor.v.01', but not always (as with 'eyeglasses' above, both are 'n'). I end up needing to write a wrapper like this:

def synsets(lemma, pos=None, lang='eng', check_exceptions=True):
    results = wn.synsets(lemma,
                         pos=pos,
                         lang=lang,
                         check_exceptions=check_exceptions)
    if lang == 'eng':
        results = [ss for ss in results if lemma in ss.lemma_names()]
    return results

Am I missing something or is this currently the best way around the issue?

alvations commented 4 years ago

@goodmami please take a look at the proposal on https://github.com/nltk/wordnet/pull/18

Soon the default NLTK WordNet API will be replaced by that standalone library =)