Latin Tokenizer is screwing up a lot of -n ending word

PonteIneptique commented 4 years ago

Basically, current, words ending with -n are assumed to be apocope of enclitic -ne, except if they are in a list of known words that are right with -n ending.

Issue is that the number of different spelling (saw today Fison instead of Phison not being caught) feels like this rule altogether should probably be removed (and honestly, the -ne as well probably ?). This is definitely a huge issue when tagging a full corpus, as it creates problematic tokens.

On the other hand, the dealing with -que and seems to be fine...

Cf. https://github.com/cltk/cltk/pull/970 and https://github.com/cltk/cltk/issues/969

PonteIneptique commented 4 years ago

Current feeling: removing the -ne and -n from CLTK territory, and use Collatinus for this part.

pjheslin commented 4 years ago

This seems to be related to the issue I raised in the deucalion-model-lasla repo, where -ne is being detected too often:

habitudi habitudi VER Case=Nom|Numb=Sing habitudi -ne ne2 CONcoo MORPH=empty -ne

PonteIneptique commented 4 years ago

Yes, this is something that comes with the tokenizer of CLTK, which I am not really happy of. I currently use a forked version, whose PR is open on their repo ( https://github.com/cltk/cltk/pull/972 ). I don't know if I should change the tokenizer, train pie to recognize enclitics (might dig into this later this month) or simply continue adding exceptions, which you can do both here https://github.com/hipster-philology/nlp-pie-taggers/blob/1aa0ec2671c734323230da19872f9d5792ed8a40/pie_extended/models/lasla/tokenizer.py#L69 and here https://github.com/hipster-philology/nlp-pie-taggers/blob/1aa0ec2671c734323230da19872f9d5792ed8a40/pie_extended/models/lasla/_params.py#L3

I'll be happy adding new exceptions ;)

pjheslin commented 4 years ago

That's a pretty ugly way to fix the problem. I like your idea of using Collatinus.

In fact, I was wondering: would make sense to do all of the tagging with Collatinus/LemLat/Morpheus, and when these produce an ambiguous parse, to train a ML model on the Lasla data that would disambiguate the parses based upon context?

PonteIneptique commented 4 years ago

I thought about it for a long time but neither of the one cited are good enough on a variety of genders and periods. The pie model is very efficient at scale. Disambiguation is also an issue to some extent, because the corpus of the LASLA, even if it is big, is restricted to classical Latin.

On top of all that, neither Col/Lem/Morph uses the same referentials for morph and lemmas, and none of them use the LASLA one, which would mean spending a long time aligning..

As for the fact it's pretty ugly, I agree, but it's also technically how Collatinus/LemLat/Morpheus deals with it...

PonteIneptique commented 4 years ago

BTW, the exceptions are computed using the collatinus decliner I wrote for CLTK :)

from cltk.stem.latin.declension import CollatinusDecliner
import json
import re

decliner = CollatinusDecliner()
known = []
errors = []
for lemma in decliner.__lemmas__:
    try:
        for form, _ in decliner.decline(lemma):
            if form:
                known.append(form)
            else:
                known.append(re.sub("\d+", "", lemma))
    except Exception as E:
        print(lemma, "Got an error", E)
        errors.append(lemma)

with open("-ne.json", "w") as f:
    json.dump([tok for tok in sorted(list(set(known))) if tok.endswith("n") or tok.endswith("ne")], f)
    print(len([tok for tok in sorted(list(set(known))) if tok.endswith("n") or tok.endswith("ne")]),
          " valid words ending with -ne or -n.")

pjheslin commented 4 years ago

I know it's a big problem that none of the other Latin parsers are based on the same dictionary, so there is no standardized way of referring to a lemma which has different possible entries. I hope maybe the LiLa project will think about creating mappings to a newly-defined standard -- it would fit with the goals of their project.

I'm surprised that LemLat does not work well on later Latin, because they claim to use all of the lemmata in du Cange. But I haven't tested that very much.

PonteIneptique commented 4 years ago

I did an in depth testing a couple years back, no tools were really impressive. With pie and the big corpus of LASLA, we finally have something good for corpus that are normalized, including neo-Latin. I think we are fine, with the kind of thing I am doing: basically automatically listing forms that are ending with the clitics we are listing.

Your issue raised the fact that I was missing some forms, and thanks for that: basically, the CLTK decliner I wrote did not use all data from collatinus. I should finish that up early next week :)

PonteIneptique commented 4 years ago

I have potentially a better fix coming for this... But this will take time.

PonteIneptique commented 4 years ago

@pjheslin the model as now been trained to recognized enclitics. It might fail at it, but it will be less violent than it currently is normally... Took some time but it should be better.

PonteIneptique commented 4 years ago

And btw, I made an online index for the lemmata : https://lascivaroma.github.io/forcellini-lemmas/index.html

PonteIneptique commented 4 years ago

Note that you need to update pie_extended AND redownload the model with pie-extended download lasla

pjheslin commented 4 years ago

I've just tested this and enclitic detection works much better now -- thanks a million! (pie-extended didn't work for me with Python 3.8, but was fine with 3.7).

Is Forcellini the dictionary used by Lasla? Or do you feed the Forcellini lemmata into pie-extended directly? That would be useful to know in cases where the lemma is numbered because of different words with the same lemma. Did you generate the list of lemmata by running pie-extended on the Lasla corpus? Or on a larger corpus?

Finally, what does it mean when pie-extended puts a question mark after the lemma. Is that an indication of the uncertainty of the parse? Or is it a case where the parser has encountered a lemma not in the Lasla corpus and has to infer its existence? Or something else?

Thanks once again for your work on this!

PonteIneptique commented 4 years ago

Yes, Forcellini is the dictionary used by the lasla. The list of lemmata is generated thanks to one of their lexicon to which I added a lot of tokens I found in their data but not these.

This particular model is trained for post-correction: I decided for now to not perform disambiguation through the DL model but through other means when necessary. Interrogation mark are basically placeholder for these disambiguation :)

pjheslin commented 4 years ago

Thanks for the clarification.

What really interests me about this project is the possibility of automated context-sensitive disambiguation by means of the DL model. Are you planning to implement that? Could you point me to the place in the code where the other means of disambiguation are implemented?

PonteIneptique commented 4 years ago

Reading your message yesterday, I thought it might be good to implement disambiguation ,as a secondary choice (ie, optionalize the model choice for Latin). As for the automatic disambiguation :

Repository here : https://github.com/ponteineptique/autocat
Reused here in Pie-Extended: https://github.com/hipster-philology/nlp-pie-taggers/blob/master/pie_extended/models/lasla/imports.py#L41
Json Files here : https://github.com/PonteIneptique/latin-lasla-models

If I remember correctly, only 8% of lemma shared similar POS and lemma spelling :) But I'd need to dive into this again... :)

pjheslin commented 4 years ago

Thanks. Let me see if I understand correctly. At the moment, pie-extended does do context-sensitive DL-based disambiguation between different forms when they differ in respect of either POS or lemma spelling (or both). When the alternatives are the same POS and lemma spelling (minus the number at the end of the lemma), it does not currently disambiguate, but puts a question mark instead of the number that would tell the difference between two lemmata that are spelled the same. Is that right?

PonteIneptique commented 4 years ago

Exactly :) On top of that, some received normalization out of the shell because LASLA or Forcellini gave them a number (eg. ___2) but 1 is never seen in the training data... :)

PonteIneptique commented 4 years ago

Closed by training a model for it.

hipster-philology / nlp-pie-taggers

Latin Tokenizer is screwing up a lot of -n ending word #8