Wrong Dutch lemmatisation even if not present in training set

BramVanroy commented 3 years ago

I was doing some basic parsing tests and found that a very mundane word was lemmatised incorrectly. The Dutch word eten ("to eat") is incorrectly lemmatised as "emmen" when given in its singular form. Below is an example with variations of "I eat like-ADV cookies" (I like to eat cookies).

import stanza

nlp = stanza.Pipeline("nl")

print("Singular forms")
print([w.lemma for w in nlp("Ik eet graag koekjes").sentences[0].words])
print([w.lemma for w in nlp("Jij eet graag koekjes").sentences[0].words])
print([w.lemma for w in nlp("Hij eet graag koekjes").sentences[0].words])
print([w.lemma for w in nlp("Zij eet graag koekjes").sentences[0].words])
print("Plural forms")
print([w.lemma for w in nlp("Wij eten graag koekjes").sentences[0].words])
print([w.lemma for w in nlp("Jullie eten graag koekjes").sentences[0].words])
print([w.lemma for w in nlp("Zij eten graag koekjes").sentences[0].words])
print("Infinitive")
print([w.lemma for w in nlp("Stop met eten").sentences[0].words])

Output:

Singular forms (incorrect lemma)
['ik', 'emmen', 'graag', 'koek']
['jij', 'emmen', 'graag', 'koek']
['hij', 'emmen', 'graag', 'koek']
['zij', 'emmen', 'graag', 'koek']
Plural forms (correct)
['wij', 'eten', 'graag', 'koek']
['jullie', 'eten', 'graag', 'koek']
['zij', 'eten', 'graag', 'koek']
Infinitive
['stoppen', 'met', 'eten']

As you can see, the singular forms are lemmatised incorrectly. To check, I did a quick grep (grep "\seet" nl_alpino-ud-train.conllu.txt) on the Alpino train file and it does contain the right form.

2       eet     eten    VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin     0       root    0:root  _
2       eet     eten    VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin     0       root    0:root  _
2       eet     eten    VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin     0       root    0:root  _
4       eet     eten    VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin     1       parataxis       1:parataxis     _
2       eet     eten    VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin     0       root    0:root  _

So I am not sure where this incorrect lemmatisation is coming from.

Environment:

OS: Windows
Python version: 3.92.
Stanza version: 1.2

AngledLuffa commented 3 years ago

Bizarre. In version 2.7 the label is "eet / emmen", so it's clear the model was doing the "right" thing with apparently wrong data. The strange thing is that in version 2.8 there are examples of "eet / eten" and not "eet / emmen", but not all of the examples you found in your grep show up. This one is missing:

# source = Treebank/eans/02_04_04_6.xml
# sent_id = eans\02_04_04_6
# text = Snoep verstandig, eet een appel!
# auto = ALUD2.5.5
1       Snoep   snoepen VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin     0       root    0:root  _
2       verstandig      verstandig      ADJ     ADJ|vrij|basis|zonder   Degree=Pos      1       advmod  1:advmod        SpaceAfter=No
3       ,       ,       PUNCT   LET     _       4       punct   4:punct _
4       eet     eten    VERB    WW|pv|tgw|ev    Number=Sing|Tense=Pres|VerbForm=Fin     1       parataxis       1:parataxis     _
5       een     een     DET     LID|onbep|stan|agr      Definite=Ind    6       det     6:det   _
6       appel   appel   NOUN    N|soort|ev|basis|zijd|stan      Gender=Com|Number=Sing  4       obj     4:obj   SpaceAfter=No
7       !       !       PUNCT   LET     _       1       punct   1:punct _

AngledLuffa commented 3 years ago

https://github.com/UniversalDependencies/UD_Dutch-Alpino/issues/5

BramVanroy commented 3 years ago

Isn't that the same as the parataxis one from my grep result? Also, where did you find the "emmen" result? I can't find any with grep "\semmen" nl_alpino-ud-train.conllu.txt.

AngledLuffa commented 3 years ago

There are three unique train files for Dutch that I'm looking at:

the one I pasted here is present in your grep, in the master branch. Weirdly it was not publicly released in UD 2.8 as far as I can tell
UD 2.8 has a subset of the eten results. I will retrain models using that dataset and send them to you later today or tomorrow. Maybe I'll wait until we hear back why there's a discrepancy between master and 2.8.
UD 2.7 is what the present Stanza release is trained on. It has emmen as the lemma. Presumably it was a data bug.

You can see there's quite a bit of activity in the git repo in the last couple weeks. My guess is one of the "too large" commits includes this particular fix:

https://github.com/UniversalDependencies/UD_Dutch-Alpino/commits/master

AngledLuffa commented 3 years ago

http://nlp.stanford.edu/~horatio/nl_alpino_tokenizer.pt

Drop that in stanza_resources/nl/tokenize/alpino.pt

then do the same for the following... adjusting the final directory as appropriate

http://nlp.stanford.edu/~horatio/nl_alpino_lemmatizer.pt http://nlp.stanford.edu/~horatio/nl_alpino_tagger.pt http://nlp.stanford.edu/~horatio/nl_alpino_parser.pt

Also necessary is to install the dev branch of stanza. Basically this is the 1.2.1 release a week or so early.

https://stackoverflow.com/questions/20101834/pip-install-from-git-repo-branch

BramVanroy commented 3 years ago

Very interesting to see how this all came to be, through a mistake up in the UD repo. Thanks for the super quick replies and fix!!

stanfordnlp / stanza

Wrong Dutch lemmatisation even if not present in training set #701