Open eduamf opened 3 years ago
These are the words that do have a lemma of convidar in this training set:
convidada convidadas convidado convidados convidámos convidando convidou
Most verb usages which end in ...vido are of the -er or -ir ending. That's probably why it is making that mistake. There are quite a few verbs in the training data which end in ...ei, but all of them are either -ar or -er verbs, so frankly IDK how it came up with an -ir ending for this one.
The GSD dataset actually includes convido, so it should theoretically process that word correctly, but neither has convidei in it. Unfortunately, we can't simply mix the two as there are significant differences in the lemma schemes.
Theoretically, we could always put together some extra data involving some less common verb endings, with the expectation that adding some specific examples will teach the models how to process it correctly.
Thank you. But, if those weird words (convidir
and conver
) were not expected, does this mean that those words are not in the dictionary?
Using spacy, when it doesn't find the word, it seems to me that the word itself returns. Another option is a method to check if the word exists in the model.
It's trying to generalize when processing words it hasn't seen before. There would be quite a few misses using a strategy which only looks at the training data.
To Reproduce Steps to reproduce the behavior:
The verb "convido" return lemma "conver", a word that do not exist in Portuguese (Brazilian or European). If I change the verb from "convido" to "convidei", changing only past tense, the returned lemma is "convidir", another weird word!
Expected behavior The expected behavior is to receive "convidar", the verb in infinitive. The used words are not homographs.
Environment (please complete the following information):
Additional context My solution was to customize the package "bosque.pt" doing this:
Then, it works ok, but just for me. The verb "convidar" in English is "invite". A frequently used verb.