Add flag to permit synset lookup without stemming

stevenbird commented 5 years ago

Cf https://github.com/nltk/nltk/issues/2421

I propose that we add a stem=False flag to wn.synsets().

It means that default behaviour for English will change, but I see no other option, given that stemming only happens for English wordnet. This would make behaviour consistent across languages.

alvations commented 5 years ago

This is actually a little complicated. The WordNet access has been heavily dependent on the morphy algorithm to fetch the synsets and setting it to stem=False would end up skipping all the exceptions that should have been in English, e.g.

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synsets('geese')
[Synset('goose.n.01'), Synset('fathead.n.01'), Synset('goose.n.03')]
>>> wn.synsets('mice')
[Synset('mouse.n.01'), Synset('shiner.n.01'), Synset('mouse.n.03'), Synset('mouse.n.04')]

wn.synsets() is not exactly doing stemming but lemmatization through morphy(). I would suggest to expose a use_morphy=True default argument instead. If lemma is found directly from the users's input to wn.synsets() then skip the morphy. Otherwise, check if the use_morphy argument is on and morphy lemmatize when necessary.

alvations commented 5 years ago

Also I think the first argument input to wn.synsets() is a misnomer, it shouldn't be lemma but word. It should have been called "word" since most people have been using words as the function's input instead of lemmas =)

alvations commented 5 years ago

Okay, now this is awkward.

Actually without modification, the "eyeglasses" example has been "resolved". Unlike the cyclic nature of the old WordNet API, the new wordnet interface don't look at the lemma_names() function that relies on the morphy lemmatizer to fetch the synsets. So by default it's following only the lemma names of lemmas that are linked only directly from the WordNet data.* files.

Existing behavior of wn without disabling morphy:

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synsets('eyeglasses')
[Synset('spectacles.n.01')]

Still, I think exposing an argument for users to disable morphy when needed is helpful. Thus #18

goodmami commented 5 years ago

This is nice, but use_morphy as a parameter name is unfortunately English-specific since the Morphy tool is (implicitly) only for English. Not only is the parameter completely irrelevant if lang is something other than eng, but what if later someone adds a lemmatizer for another language? I agree that stem as a parameter name is inaccurate as it's lemmatizing and not stemming, so why not lemmatize?

Also, I'm with @stevenbird that it's best to make behavior consistent for all languages instead of special-casing English. Since you're replacing the default WordNet module in the NLTK, this seems like a good time to introduce such a change. However, if NLTK follows semantic versioning and you're not ready to make a 4.0 release (because of the backward-compatibility breakage), you could make the default True and issue a warning (WordNetWarning("lemmatization is not provided for this language") or DeprecationWarning("lemmatization will be turned off by default in the next major version")), then make the change for a later release.

Finally, it would be even better if users could supply their own lemmatizer. E.g., wn.synsets(word, lang='xyz', lemmatize=lemmatize_xyz) where lemmatize_xyz is a compatible function for lemmatizing words in language xyz. This way users could even use other lemmatizers for English, too. For convenience, if lemmatize=True then it uses the default function depending on the value of lang, and if none exists for the language, an error is raised.

stevenbird commented 5 years ago

Thanks for these suggestions @goodmami.

So default behaviour would be to use a lemmatizer if available else proceed without (issuing the warning). The only change required will be for users of wordnets other than English who have to tweak their code to avoid the warning.

And if a function is passed, we use it.

alvations commented 4 years ago

@goodmami @stevenbird got some free time to look at this again.

Let me try to confirm the requirements before I reimplement stuff =)

The desired interface would be wn.synsets(word, lang='xyz', lemmatize_func=xyz_lemmatize) where xyz_lemmatize() is a compatible function that takes a token and return the lemmatized form.
By default, the language would be set to English lang='en' and the default lemmatize_func=None.
And for back-compatibility, we can for now, enforce morphy lemmatizer if lang=='en', and raise a warning that this would be remove in future versions.

Does the requirements sound about right?

goodmami commented 4 years ago

That's close to what I was thinking. But more specifically:

DEFAULT_LEMMATIZERS = {
    'eng': morphy,
    ...
}

def synsets(word, pos=None, lang='eng', check_exceptions=True, lemmatize=True):
    if lemmatize is True:
        if lang not in DEFAULT_LEMMATIZERS:
            warnings.warn(
                WordNetWarning,
                "No default lemmatizer for language '{}'".format(lang))
            lemmatize = False
        lemmatize = DEFAULT_LEMMATIZERS[lang]
    if lemmatize:
        word = lemmatize(word, pos=pos, check_exceptions=check_exceptions)
    ...

This way we keep the default behavior, but users can easily disable English lemmatization with lemmatize=False. For other values of lang, only a warning (not an error) will appear if there is no lemmatizer defined and they don't change the default value of lemmatize. And other lemmatizers can be used by passing a compatible function in directly. That function would have the signature lemmatize(word, pos=None, check_exceptions=True) for compatibility, but the latter two may not be relevant for other lemmatizers. Actually I'd rather get rid of check_exceptions and instead let users pass in things like lemmatize=morphy_no_exceptions or something, but I kept it in for backward compatibility.

Finally, I now wonder if "lemmatize" is even the right word, because I can imagine users only wanting simple normalization, like downcasing. Maybe normalize?

nltk / wordnet

Add flag to permit synset lookup without stemming #17