Closed BramVanroy closed 3 years ago
Bizarre. In version 2.7 the label is "eet / emmen", so it's clear the model was doing the "right" thing with apparently wrong data. The strange thing is that in version 2.8 there are examples of "eet / eten" and not "eet / emmen", but not all of the examples you found in your grep show up. This one is missing:
# source = Treebank/eans/02_04_04_6.xml
# sent_id = eans\02_04_04_6
# text = Snoep verstandig, eet een appel!
# auto = ALUD2.5.5
1 Snoep snoepen VERB WW|pv|tgw|ev Number=Sing|Tense=Pres|VerbForm=Fin 0 root 0:root _
2 verstandig verstandig ADJ ADJ|vrij|basis|zonder Degree=Pos 1 advmod 1:advmod SpaceAfter=No
3 , , PUNCT LET _ 4 punct 4:punct _
4 eet eten VERB WW|pv|tgw|ev Number=Sing|Tense=Pres|VerbForm=Fin 1 parataxis 1:parataxis _
5 een een DET LID|onbep|stan|agr Definite=Ind 6 det 6:det _
6 appel appel NOUN N|soort|ev|basis|zijd|stan Gender=Com|Number=Sing 4 obj 4:obj SpaceAfter=No
7 ! ! PUNCT LET _ 1 punct 1:punct _
Isn't that the same as the parataxis one from my grep result? Also, where did you find the "emmen" result? I can't find any with grep "\semmen" nl_alpino-ud-train.conllu.txt
.
There are three unique train files for Dutch that I'm looking at:
eten
results. I will retrain models using that dataset and send them to you later today or tomorrow. Maybe I'll wait until we hear back why there's a discrepancy between master and 2.8.emmen
as the lemma. Presumably it was a data bug. You can see there's quite a bit of activity in the git repo in the last couple weeks. My guess is one of the "too large" commits includes this particular fix:
https://github.com/UniversalDependencies/UD_Dutch-Alpino/commits/master
http://nlp.stanford.edu/~horatio/nl_alpino_tokenizer.pt
Drop that in stanza_resources/nl/tokenize/alpino.pt
then do the same for the following... adjusting the final directory as appropriate
http://nlp.stanford.edu/~horatio/nl_alpino_lemmatizer.pt http://nlp.stanford.edu/~horatio/nl_alpino_tagger.pt http://nlp.stanford.edu/~horatio/nl_alpino_parser.pt
Also necessary is to install the dev branch of stanza. Basically this is the 1.2.1 release a week or so early.
https://stackoverflow.com/questions/20101834/pip-install-from-git-repo-branch
Very interesting to see how this all came to be, through a mistake up in the UD repo. Thanks for the super quick replies and fix!!
I was doing some basic parsing tests and found that a very mundane word was lemmatised incorrectly. The Dutch word eten ("to eat") is incorrectly lemmatised as "emmen" when given in its singular form. Below is an example with variations of "I eat like-ADV cookies" (I like to eat cookies).
Output:
As you can see, the singular forms are lemmatised incorrectly. To check, I did a quick grep (
grep "\seet" nl_alpino-ud-train.conllu.txt
) on the Alpino train file and it does contain the right form.So I am not sure where this incorrect lemmatisation is coming from.
Environment: