stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.22k stars 886 forks source link

PT verb lemmatization return words that do not exist in Portuguese #664

Open eduamf opened 3 years ago

eduamf commented 3 years ago

To Reproduce Steps to reproduce the behavior:

import stanza
stanza.download('pt')
stnz = stanza.Pipeline('pt', use_gpu=False)
text = stnz("convido-os a levantarem-se para um minuto de silêncio .")
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}\tupos: {word.upos}' for sent in text.sentences for word in sent.words], sep='\n')

The verb "convido" return lemma "conver", a word that do not exist in Portuguese (Brazilian or European). If I change the verb from "convido" to "convidei", changing only past tense, the returned lemma is "convidir", another weird word!

Expected behavior The expected behavior is to receive "convidar", the verb in infinitive. The used words are not homographs.

Environment (please complete the following information):

Additional context My solution was to customize the package "bosque.pt" doing this:

# Load word_dict and composite_dict
import torch
model = torch.load('/home/eduardo/stanza_resources/pt/lemma/bosque.pt', map_location='cpu')
word_dict, composite_dict = model['dicts']
words = ["convidai","convidais","convidam","convidava","convidamo","convidamos","convidando","convidara",
         "convidaras","convidáramos","convidardes","convidareis","convidarão","convidaria","convidarias",
         "convidarem","convida","convidaremos","convidares","convidarmos","convidarás","convidará",
         "convidaríamos","convidaríeis","convidariam","convide","convidas","convida","convidasses",
         "convidássemos","convidaste","convidou","convidastes","convidaram","convidavas","convide","convideis",
         "convidem","convidasse","convideis","convidar","convidemos","convides","convido","convidáreis",
         "convidaram","convidarei","convidásseis","convidassem","convidávamos","convidáveis","convidavam",
         "convidei"]
for word in words:
    # Customize your own dictionary
    composite_dict[(word, 'VERB')] = 'convidar'
    word_dict[word] = 'convidar'
# Save your model
torch.save(model, '/home/eduardo/stanza_resources/pt/lemma/bosque_eduamf.pt')

Then, it works ok, but just for me. The verb "convidar" in English is "invite". A frequently used verb.

# Load your customized model with Stanza
import stanza
stnz = stanza.Pipeline('pt', package='bosque', processors='tokenize,pos,lemma',
                      lemma_model_path='/home/eduardo/stanza_resources/pt/lemma/bosque_eduamf.pt',
                      use_gpu=False)
AngledLuffa commented 3 years ago

These are the words that do have a lemma of convidar in this training set:

convidada convidadas convidado convidados convidámos convidando convidou

Most verb usages which end in ...vido are of the -er or -ir ending. That's probably why it is making that mistake. There are quite a few verbs in the training data which end in ...ei, but all of them are either -ar or -er verbs, so frankly IDK how it came up with an -ir ending for this one.

The GSD dataset actually includes convido, so it should theoretically process that word correctly, but neither has convidei in it. Unfortunately, we can't simply mix the two as there are significant differences in the lemma schemes.

Theoretically, we could always put together some extra data involving some less common verb endings, with the expectation that adding some specific examples will teach the models how to process it correctly.

eduamf commented 3 years ago

Thank you. But, if those weird words (convidir and conver) were not expected, does this mean that those words are not in the dictionary? Using spacy, when it doesn't find the word, it seems to me that the word itself returns. Another option is a method to check if the word exists in the model.

AngledLuffa commented 3 years ago

It's trying to generalize when processing words it hasn't seen before. There would be quite a few misses using a strategy which only looks at the training data.