ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
358 stars 75 forks source link

Bad performance on swedish #28

Closed EmilStenstrom closed 7 years ago

EmilStenstrom commented 7 years ago

Hi. I'm switching an old project, that does parsing of Swedish sentences, from a custom parser to UDPipe. But when I compare the results for simple sentences I get a pretty bad result.

I'm using this example sentence to show the difference: "Hitta ordklass i svensk text". I've added exclamation marks where results are incorrect.

Swe word Eng trans upos features
Hitta Find VERB Mood=Imp❗️, VerbForm=Fin, Voice=Act
ordklass word class PRON❗️ Case=Acc, Definite=Def, Gender=Com, Number=Plur❗️
i in ADP -
svensk Swedish ADJ Case=Nom, Definite=Ind, Degree=Pos, Gender=Com, Number=Sing
text text NOUN Case=Nom, Definite=Ind, Gender=Com, Number=Sing

The other tagger handles all of the above examples correctly. Is it because the architecture is different (structured perceptron using greedy search for decoding), or because they use a larger corpus?

martinpopel commented 7 years ago

UDPipe pretrained models are trained on the UD data only, they do not have access to any (Swedish morphological) lexicon (while efselab's documentation mentions a lexicon-based lemmatizer). If needed, Morphodita (used in UDPipe) can be trained with an additional morphological lexicon. Also, I guess the new version of UDPipe will use a better morphology analysis (NN-based, character-level embeddings).

As for your last questions, the main cause seems to be the larger corpus:the Stockholm-Umeå Corpus contains word "ordklass" tagged as noun, while UD_Swedish does not contain this word.

Regarding the "Hitta". What should be the mood, if not imperative? Is the English translation correct? Is the sentence grammatical if it starts with non-imperative verb and does not contain a subject?

EmilStenstrom commented 7 years ago

Thank you @martinpopel, that explains it.

You are right that Mood=Imp is probably correct. That that feature is missing is probably a mistake in the other tagger. Looking forward to improvements here in the future, but understand that this is a hard problem.

EmilStenstrom commented 7 years ago

One more question: are the pretrained models trained on both of the Swedish datasets or just one of them?

foxik commented 7 years ago

I assume you are still using models based on UD 1.2 -- at that time there was only one Swedish corpus.

We have models trained on UD 2.0 where we have separate models for both Swedish corpora. The problem of training a model on multiple corpora is that sometimes the corpora do not have consistent annotation (for example UD_Swedish has lemmas and UD_Swedish-LinES does not), which current UDPipe cannot handle, but we are planning to improve that.

martinpopel commented 7 years ago

See UDv1.2 models and the newly released UDv2.0 models.

foxik commented 7 years ago

Also note that the UDv2.0 models require UDPipe 1.1, which has not yet been officially released (it lives in pre1.1 branch), but should be soon.

EmilStenstrom commented 7 years ago

Ah, thanks for clarifying. I actually saw that there where other models released, I just didn't realize that they where general, and not JUST for the CoNLL 2017 Shared Task. Looking forward to the 1.1 release! :)

EmilStenstrom commented 7 years ago

Just wanted to report back:

As you suspected, with UD-2.0 the strangeness with "ordklass" being a PRON is gone. Now it's correctly identified as a NOUN. Thanks again for your hard work, you are really making a difference for multi-language NLP.

foxik commented 7 years ago

Thanks, that is nice to hear :-)