stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 885 forks source link

lemmatizer is deterministic for form/pos even for words which can have multiple correct lemmas #1286

Open AngledLuffa opened 11 months ago

AngledLuffa commented 11 months ago

example:

most common: 's_VERB can be either is or has

less likely, but still possible: wound_VERB can be wound or wind bound, found as well

zeeyado commented 3 months ago

Hi, do you know of a workaround for this? Returning all the lemmas to which a word can belong.

AngledLuffa commented 3 months ago

Yes, actually, we put together a small classifier model for some of the most common cases in the treebanks. We just haven't released it yet. Probably mid to late June

zeeyado commented 3 months ago

Is this classifier model just a list/index of the most common cases in English?

I was wondering if it's possible to do multilingual lemmatization of single words, returning all possible lemmas, e.g. for "saw" you would get the noun "saw" and the verb "see".

Is something like that feasible at all or would it be better to use a dictionary/lookup approach?

AngledLuffa commented 3 months ago

There's not really a way to get back all possible expansions. The seq2seq model doesn't know which POS are possible for new words.

The dictionary does already take into account POS, so your particular "saw" example is already covered. The distinction we will soon fix is "I need to saw this lumber" vs "I saw a pile of lumber"