several common Spanish verbs forms don't use the infinitive as the lemma

psnider commented 1 year ago

Describe the bug It seems that a fair number of Spanish verbs return the wrong lemma. I've summarized a Spanish text of ~3,500 words, and found ~500 different lemmas, with ~140 verbs. However, at least 7 of these verbs have incorrect lemmas. (I'm a beginner in Spanish, but I believe this is very easy to determine, because all verb infinitives end in the letter "r".)

Here are the seven verbs, with the incorrect lemmas: reproche: from "que me reproches", the infinitive is reprochar sueño: from "te sueño despierta", the infinitive is soñar calma: from "bálsamo que calma", the infinitive is calmar canto: from "esta melodía que canto hoy", the infinitive is cantar gira: from "Él gira a la derecha", the infinitive is girar ando: from "Que ando cargando", the infinitive is andar saludo: from "saludo mi presente", the infinitive is saludar

To Reproduce

import stanza
from stanza.pipeline.core import Pipeline
processors = "tokenize,mwt,ner,sentiment,pos,lemma"
nlp_es_parse = Pipeline(lang="es", processors=processors)
text = """
    que me reproches,
    te sueño despierta,
    sol que calma,
    esta melodía que canto hoy,
    Él gira a la derecha,
    que ando cargando,
    saludo mi presente
""" 

nlp_results = nlp_es_parse(text)

nlp_results contains the following corresponding parts (with some fields removed):

    {
      "id": 3,
      "text": "reproches",
      "lemma": "reproche",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin",
    },
    {
      "id": 6,
      "text": "sueño",
      "lemma": "sueño",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin",
    },
    {
      "id": 11,
      "text": "calma",
      "lemma": "calma",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
    },
    {
      "id": 16,
      "text": "canto",
      "lemma": "canto",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin",
    },
    {
      "id": 20,
      "text": "gira",
      "lemma": "gira",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
    },
    {
      "id": 26,
      "text": "ando",
      "lemma": "ando",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin",
    },
    {
      "id": 29,
      "text": "saludo",
      "lemma": "saludo",
      "upos": "VERB",
      "feats": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin",
    },

Expected behavior The infinitive form of the verb should be given as the lemma.

Environment (please complete the following information):

MacOS Monterey 12.1 (21C52)
Python 3.9.7 from Anaconda3
stanza-1.5.0
java version "1.8.0_371"

Additional context

psnider commented 1 year ago

As a work around, I added mappings for the incorrect lemmas to their infinitives:

spanish_verbs_with_incorrect_lemmas = {
    "ando":  "andar",
    "calma": "calmar",
    "canto": "cantar",
    "corresponderte": "corresponder",
    "gira":  "girar",
    "reproche": "reprochar",
    "saludo": "saludar",
    "sueño": "soñar",
}

Then my application code coerces the lemmas (for verbs only) if they are in this map.

And I also issue an error message whenever I discover a lemma for a verb, when that lemma doesn't end in "r". This then makes it easier to find such incorrect lemmas.

AngledLuffa commented 1 year ago

Next few weeks we are unlikely to fix this - end of the month I can brush off some old work for expanding the lemmas that our Spanish tools can produce, and hopefully that will help

psnider commented 1 year ago

No worries, and I really appreciate your prompt reply!

I looked at the notes about the UD Spanish AnCora corpus from which the Spanish model was trained. See: https://universaldependencies.org/treebanks/es_ancora/index.html It seems that this was made from news (on line articles?), and it appears "This corpus contains 17662 sentences".

I am new to NLP, but 17k sentences seems like very little. And news is not much like conversation. After poking around it seems that the right place for me to engage is: https://github.com/UniversalDependencies/UD_Spanish-AnCora

AngledLuffa commented 1 year ago

17K isn't that bad, but it will definitely be missing many verbs. Ideally the tokenizer would pick up the correct pattern, but apparently not in this case...

The solution we often go with is to mix multiple datasets together, but I remember finding that was not really feasible in Spanish because of labeling differences. Perhaps there is some room to mix just the tokenizer datasets, or add sentences specifically with the infinitives. Either way, we'll have to revisit that in July.

On Mon, Jun 5, 2023 at 4:02 PM psnider @.***> wrote:

No worries, and I really appreciate your prompt reply!

I looked at the notes about the UD Spanish AnCora corpus from which the Spanish model was trained. See: https://universaldependencies.org/treebanks/es_ancora/index.html It seems that this was made from news (on line articles?), and it appears "This corpus contains 17662 sentences".

I am new to NLP, but 17k sentences seems like very little. And news is not much like conversation. After poking around it seems that the right place for me to engage is: https://github.com/UniversalDependencies/UD_Spanish-AnCora

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1255#issuecomment-1577597763, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMLBKMHBSHJEJ2BHL3XJZQRVANCNFSM6AAAAAAY2CW72U . You are receiving this because you commented.Message ID: @.***>

psnider commented 1 year ago

I noticed that it is fairly common for Latin American Spanish to include Portuguese! So I'm trying to determine the language as well, so that I don't accidentally consider a lemma for a Portuguese verb as being wrong.

I will be careful to check this as well for any more verbs I find with the incorrect lemma.

psnider commented 1 year ago

Here is an updated list of verbs I found, after processing documents containing 90k words. I also tried to identify Portuguese (using Stanza), and filtered those out of the results. So this list is just Spanish verbs.

I also noticed that some of the lemmas are capitalized, which seems incorrect. For example, see "Miren" in the map below.

Here is a map from incorrect lemmas to correct ones, in case this helps. Just a reminder, I'm a beginning Spanish student, so I may have made some mistakes.

spanish_verbs_with_incorrect_lemmas = {
    "acariciado": "acariciar",
    "acuerdo": "acordar",
    "aislado": "aislar",
    "ando":  "andar",
    "apuesto": "apostar",
    "asesinado": "asesinar",
    "armado": "armar",
    "bota": "botar",
    "calma": "calmar",
    "canto": "cantar",
    "cito": "citar",
    "colapsado": "colapsar",
    "coma": "comer",
    "come": "comer",
    "construido": "construir",
    "consultado": "consultar",
    "convencido": "convencer",
    "corresponderte": "corresponder",
    "definido": "definir",
    "derrotado": "derrotar",
    "descuidado": "descuidar",
    "desmantelaba": "desmantelar",
    "dividido": "dividir",
    "duerma": "dormir",
    "editado": "editar",
    "encantado": "encantar",
    "encontrado": "encontrar",
    "escrito": "escribir",
    "espera": "esperar",
    "esperando": "esperar",
    "esponjaba": "esponjar",
    "está": "estar",
    "excluido": "excluir",
    "expuesto": "exponer",
    "firma": "firmar",
    "forzado": "forzar",
    "fuera": "ser",
    "gira":  "girar",
    "gustaba": "gustar",
    "identifica": "identificar",
    "importa": "importar",
    "lanzado": "lanzar",
    "manipulado": "manipular",
    "matriculado": "matricular",
    "miren": "mirar",
    "Miren": "mirar",
    "Miró": "mirar",
    "miró": "mirar",
    "nací": "nacer",
    "otorgado": "otorgar",
    "pasaba": "pasar",
    "perjudicado": "perjudicar",
    "picado": "picar",
    "prestado": "prestar",
    "quiso": "querer",
    "reconocido": "reconocido",
    "recorrido": "recorrer",
    "regreso": "regresar",
    "reproche": "reprochar",
    "ríe": "reír",
    "saco": "sacar",
    "saludo": "saludar",
    "son": "ser",
    "suena": "sonar",
    "sueño": "soñar",
    "televisado": "televisar",
    "tira": "tirar",
    "traslado": "trasladar",
    "va": "ir",
    "valía": "valer",
    "vaya": "ir",
    "vetado": "vetar",
}

stanfordnlp / stanza

several common Spanish verbs forms don't use the infinitive as the lemma #1255