stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.31k stars 896 forks source link

Latin default package doesn't usually lemmatize words starting with a capital letter #1330

Closed pseudomonas closed 2 months ago

pseudomonas commented 10 months ago

Latin default package (ITTB) doesn't usually lemmatize words starting with a capital letter. This seems to be the case whether the word is a proper noun, normally capitalised (eg "Iacobi"), a common word that is extraordinarily capitalised, or a word capitalised out of devotion (eg "Deo"). This seems to be a systematic problem though in the example below "Erat" is lemmatized to "sum"; I have not done any digging into what might provoke this behaviour.

To Reproduce see code below

Environment (please complete the following information):

import stanza
latindefault = stanza.Pipeline('la', processors='tokenize,pos,lemma' )
#%%

sent = "Quod Erat Demonstrandum" 

print(latindefault(sent))

#### Correctly diagnoses parts of speech; does not lemmatize.
 # {
 #      "id": 3,
 #      "text": "Demonstrandum",
 #      "lemma": "Demonstrandum",
 #      "upos": "VERB",
 #      "xpos": "J2|modO|grp1|casA|gen3",
 #      "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
 #      "start_char": 10,
 #      "end_char": 23
 #    }

print(latindefault(sent.lower()))
#### Correctly diagnoses parts of speech and lemmatizes.

# {
#       "id": 3,
#       "text": "demonstrandum",
#       "lemma": "demonstro",
#       "upos": "VERB",
#       "xpos": "J2|modO|grp1|casA|gen3",
#       "feats": "Aspect=Prosp|Case=Nom|Gender=Neut|InflClass=LatA|InflClass[nominal]=IndEurO|Number=Sing|VerbForm=Part|Voice=Pass",
#       "start_char": 10,
#       "end_char": 23
#     }
AngledLuffa commented 10 months ago

The entire treebank is lowercase letters. I could imagine adding a feature where if the treebank is >99% lowercase, the model always lowercases everything.

Interestingly, the POS model already lowercases before using the word vectors, hence not failing horribly when feeding the model capitals.

pseudomonas commented 10 months ago

I'd expect it to behave like an uncased model and it's not a huge faff for me to just convert everything to lower-case before processing it. It just seemed like an unfortunate quirk.

AngledLuffa commented 10 months ago

Just to verify, what you want is the lemmas

qui sum demonstro
AngledLuffa commented 10 months ago

If you try the lowercase_lemmas branch, the la_ittb lemmatizer will now automatically treat all text as if it were lowercased. I haven't done anything with the tokenizer or POS yet, though. Have you noticed the tokenizer behaving badly with capitalized letters?

AngledLuffa commented 10 months ago

Any thoughts on this fix?

AngledLuffa commented 9 months ago

The lemmatizer now trains a caseless version of itself if all of the training data is caseless, as proposed in the above PR. The 1.8.1 version of the Latin lemmatizer uses that feature, so the lemmatizer gives the same output for any capitalization variation of "quod erat demonstrandum".

POS and depparse already use caseless versions of the word embeddings, so the impact of the casing is a lot less on those words.

Please let us know if this satisfies the issue