stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

German lemmatizer performance is bad? #1382

Open Brentably opened 5 months ago

Brentably commented 5 months ago

Hello! I'm currently trying to use Stanza's German lemmatizer for a project I'm working on. As far as I'm concerned, this should be on par with the most accurate publically available lemmatizers out there, if not the most.

However, I'm really confused by the poor German performance. I get the following results when lemmatizing:

möchtest => möchtessen (should be mögen) Willst => Willst (should be wollen) sagst => sagst (should be sagen) Sage => Sage (should be sagen) aß => aß (should be essen) Sprich => Sprich (should be sprechen)

These are all top ~50 verbs in german and none of these inflections are crazy rare, so I'm really confused by the performance. I recently did some digging and found out that HDT should be more accurate, and it is, but the results are still unimpressive:

möchtest => möchtes (should be mögen) Willst => Willst (should be wollen) sagst => sagsen (should be sagen) Sage => sagen (correct) aß => assen (should be essen) Sprich => sprechen (correct)

This gets 2/6 correct instead of 0/6, but ofc that's still really poor.

I recently found this website cooljugator: https://cooljugator.com/de and for instance, you can just search up a verb, either conjugated or infinitive, and it seems to have near perfect performance for all of these.

Can anyone explain or point me in the right direction?

I'm considering getting a bunch of data and trying to supplement performance with my own lookup table right now, but would rather not spend the few days of effort that would require.

Thanks!

AngledLuffa commented 5 months ago

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

One example which shows up in the training data with a different result is Sage. In each of the following sentences, the GSD training data has Sage -> Sage:

# text = Der Sage nach wurden die Nelken 1270 vom Heer des französischen Königs Ludwig IX.
# text = Die Sage, deren historischer Gehalt nicht zu sichern ist, hat insofern ätiologische Funktion.
# text = In den 1920er Jahren hatte er Kontakt mit Cornelia Bentley Sage Quinton, die als erste Frau in den USA ein größeres Kunstmuseum leitete.

One thought which occurs to me is that maybe the lemmatizer's model should have some input based on the POS tag given, whereas it currently doesn't use the POS except for the dictionary lookup. I wonder if that would help in terms of lemmatizing unknown words.

Brentably commented 5 months ago

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

You mean like some better lookup data? TBH I was just going to scrape some stuff, but would be happy to send it along.

Also, pardon my naiveté but I'm just generally confused? Isn't this like state of the art for lemmatizers? Are the best lemmatizers all closed source, made in-house, or are there just not that many non-english lemmatizer-dependent applications? Is there another popular solution to this problem that I am ignorant to?

AngledLuffa commented 5 months ago

The performance was measured on the test portions of the datasets, so to the extent those are limited and don't really cover some important concepts, the test scores will also reflect that.

I don't know what the best German lemmatizer is, but I can take some time later this week or in a chat with my PI to figure out other sources of training data, and I think embedding the POS tags in the seq2seq model will likely help it know whether or not to use a verb style ending or noun style ending in a language such as German for unknown words

AngledLuffa commented 5 months ago

options for additional training data, from @manning

I think the two main choices are: https://github.com/Liebeck/IWNLP.Lemmatizer (uses Wikidict, probably good for future) https://github.com/WZBSocialScienceCenter/germalemma (says unmaintained).

I also have high hopes for using the POS as an input embedding to the seq2seq at least helping, but @manning points out that there are a lot of irregulars in German which may or may not be helped by such an approach.

I don't expect to get to this in the next couple days, but perhaps next week or so I can start in on it

Brentably commented 5 months ago

I scraped some ~5000 words of data from a conjugation / declination website. They seem to be high quality.

AngledLuffa commented 5 months ago

That does sound like it could be a useful resource!

Brentably commented 5 months ago

Sent you an email!

AngledLuffa commented 1 month ago

I started going through the lemma sheet you sent, thinking we could add that as a new lemmatizer model in the next version. (Which will hopefully be soon.)

One thing I came across in my investigation is a weirdness in the GSD lemmas for some words, but not all:

https://github.com/UniversalDependencies/UD_German-GSD/issues/35

I also found some inconsistencies in the json you'd sent us. (Was that script in typescript?)

so for example, early on, words that translate as "few" and "at least" are included in the same lemma:

{
    "word": "wenig",
    "pos": "adj",
    "versions": [
      "weniger",
      "wenigen",
      "wenigem",
      "wenige",
      "weniges",
      "wenig",
      "minder",
      "mindesten"
    ]
  },

wenig and mindesten translate differently on google translate, and mindesten is treated as its own lemma in GSD:

Also treated differently in GSD: welches -> welcher, not welch, and the pos is DET

33      welches welcher DET     PRELS   Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel       37      obj     _       _

 {
    "word": "welch, -e, -er, -es",
    "pos": "pron",
    "versions": ["welch", "welche", "welcher", "welches", "welchen", "welchem"]
  },

There are some unusual POS in the data you sent us:

POS of NOUN for Mann, Mannes, Männer, Männern

{
    "word": "Mann",
    "pos": "der",
    "versions": ["Mann", "Mannes", "Manns", "Manne", "Männer", "Männern"]
  },

Also NOUN:

{
    "word": "Kind",
    "pos": "das",
    "versions": ["Kind", "Kindes", "Kinds", "Kinde", "Kinder", "Kindern"]
  },

Ambiguous is hard for us to resolve in an automated fashion:

{
    "word": "kein",
    "pos": "pron/art",
    "versions": ["kein", "keines", "keine", "keinem", "keinen", "keiner"]
  },

not sure what to do with:

  { "word": "nichts, nix", "pos": "pron", "versions": ["nichts", "nix"] },
  { "word": "nun, nu", "pos": "adv", "versions": ["nun", "nu"] },

another example of POS that isn't a UPOS:

  { "word": "Frage", "pos": "die", "versions": ["Frage", "Fragen"] },
  { "word": "Hand", "pos": "die", "versions": ["Hand", "Hände", "Händen"] },

If you can resolve these or suggest how to resolve them, we can include this in the lemmatizer. Certainly in terms of adding a long list of verb, noun, & adj conjugations & declensions, it would be quite useful to avoid future German lemmatizer mistakes.