Closed pseudomonas closed 2 months ago
The entire treebank is lowercase letters. I could imagine adding a feature where if the treebank is >99% lowercase, the model always lowercases everything.
Interestingly, the POS model already lowercases before using the word vectors, hence not failing horribly when feeding the model capitals.
I'd expect it to behave like an uncased model and it's not a huge faff for me to just convert everything to lower-case before processing it. It just seemed like an unfortunate quirk.
Just to verify, what you want is the lemmas
qui sum demonstro
If you try the lowercase_lemmas
branch, the la_ittb
lemmatizer will now automatically treat all text as if it were lowercased. I haven't done anything with the tokenizer or POS yet, though. Have you noticed the tokenizer behaving badly with capitalized letters?
Any thoughts on this fix?
The lemmatizer now trains a caseless version of itself if all of the training data is caseless, as proposed in the above PR. The 1.8.1 version of the Latin lemmatizer uses that feature, so the lemmatizer gives the same output for any capitalization variation of "quod erat demonstrandum".
POS and depparse already use caseless versions of the word embeddings, so the impact of the casing is a lot less on those words.
Please let us know if this satisfies the issue
Latin default package (ITTB) doesn't usually lemmatize words starting with a capital letter. This seems to be the case whether the word is a proper noun, normally capitalised (eg "Iacobi"), a common word that is extraordinarily capitalised, or a word capitalised out of devotion (eg "Deo"). This seems to be a systematic problem though in the example below "Erat" is lemmatized to "sum"; I have not done any digging into what might provoke this behaviour.
To Reproduce see code below
Environment (please complete the following information):