stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

[QUESTION] German contraction of "an dem" to "am" #1369

Closed GeorgeS2019 closed 3 months ago

GeorgeS2019 commented 6 months ago

Am

“Am” is a contraction of “an” and “dem”.

An dem

“An dem” is used when you want to keep “an” and “dem” separate for emphasis or clarity.

How Stanza handles them?

One word "am" with the right word id has TWO more additional words: "an dem"

It is simpler to just parse an int coming back from a word.id. Now, instead of int, it is an array referencing the TWO additional words

The challenges: The parent word has start_char and end_char, but the other morphological features are now transferred to the child word e.g. dem

Question

I wonder how best to handle this when parsing.

[1]

[2]

image

{
    "id": [
      10,
      11
    ],
    "text": "am",
    "start_char": 56,
    "end_char": 58,
    "ner": "O",
    "multi_ner": [
      "O"
    ]
  },
  {
    "id": 10,
    "text": "an",
    "lemma": "an",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 12,
    "deprel": "case"
  },
  {
    "id": 11,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 12,
    "deprel": "det"
  }
GeorgeS2019 commented 6 months ago

I double if Spacy would handle this way, I am simply curious

AngledLuffa commented 6 months ago

This is a complicated question which comes up frequently, and people never seem to like the answer. However, my impression of that is probably the same as bullet holes in planes - only the people who don't like the answer show up on github.

This is what CoreNLP does:

edit: this whole German CoreNLP section was done with the wrong annotation pipeline, see below

NLP> Der Firma liegt genau am Ortseingang.

Sentence #1 (7 tokens):
Der Firma liegt genau am Ortseingang.

Tokens:
[Text=Der CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP Lemma=Der NamedEntityTag=O]
[Text=Firma CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NNP Lemma=Firma NamedEntityTag=O]
[Text=liegt CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=NN Lemma=liegt NamedEntityTag=O]
[Text=genau CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=NN Lemma=genau NamedEntityTag=O]
[Text=am CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=VBP Lemma=be NamedEntityTag=O]
[Text=Ortseingang CharacterOffsetBegin=25 CharacterOffsetEnd=36 PartOfSpeech=NNP Lemma=Ortseingang NamedEntityTag=PERSON]
[Text=. CharacterOffsetBegin=36 CharacterOffsetEnd=37 PartOfSpeech=. Lemma=. NamedEntityTag=O]

Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, Ortseingang-6)
compound(Firma-2, Der-1)
compound(genau-4, Firma-2)
compound(genau-4, liegt-3)
nsubj(Ortseingang-6, genau-4)
cop(Ortseingang-6, am-5)
punct(Ortseingang-6, .-7)

The original training data in the UD treebank was

# sent_id = train-s25
# text = Der Firma liegt genau am Ortseingang.
1       Der     der     DET     ART     Case=Nom|Definite=Def|Gender=Masc|Number=Sing|PronType=Art      2       det     _       _
2       Firma   Firma   NOUN    NN      Case=Nom|Gender=Masc|Number=Sing        3       nsubj   _       _
3       liegt   liegen  VERB    VVFIN   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0       root    _       _
4       genau   genau   ADV     ADV     _       7       advmod  _       _
5-6     am      _       _       _       _       _       _       _       _
5       an      an      ADP     APPR    _       7       case    _       _
6       dem     der     DET     ART     Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art      7       det     _       _
7       Ortseingang     Ortseingang     NOUN    NN      Case=Dat|Gender=Masc|Number=Sing        3       obl     _       SpaceAfter=No
8       .       .       PUNCT   $.      _       3       punct   _       _

The thing with the CoreNLP representation is, am is not a copular verb as far as I know. Google translate says it means "at the". Also, it's completely missing that the liegt is the verb. Basically that representation sucks.

The problem is that am in fact represents two words at the same time - the adposition and the determiner. If you just implement one tag for the entire token, probably the adposition, leaving out the determiner, that would be a little weird. Even more awkward would be a combination tag of some kind (although to be fair some datasets have adopted that approach, such as the Korean UD treebanks)

The solution UD adopted for most languages is to represent the text as a single token, am in this case, and split the analysis into the two words, an and dem. It is true there are some inconveniences here as well, such as an does not correspond to an actual start & end character. However, it makes analysis of words such as am much easier, since now you can analyze both words that it represents in a proper manner.

This happens in other languages. In Spanish, the pronoun clitics get split from verbs - otherwise you'd have 10x as many verbs to analyze. In English, the entire class of possessives, standard contractions such as can't, won't, it's, and colloquial contractions such as cannot, gonna, wanna. Then at the edges you can have 20 response long threads on UD about kinda or mighta as possible additions to the splittable lexicon... (These kind of threads alternate between amusing me every time I kick one off and discouraging me from asking in the first place about the best way for our software to analyze specific text)

Long story short, if all you want is the analysis of the pieces, you can either filter out from the json / dict representation any token whose id isn't just an int, or you can call doc.sentences[idx].words() instead of using the dict representation. That might be a little unsatisfying since it won't have character offsets in a language such as German, where the MWT don't split into easily understood pieces (compare to English, where we split cannot -> can not... how would you split am as text?). The Word objects each have a pointer to the enclosing Token, though, and the Token does have the start_char and end_char for the entire piece of text.

As for spacy, it does

>>> doc = nlp("I don't know what spacy does with MWT")
>>> for token in doc:
...     print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
do AUX aux
n't PART neg
know VERB ROOT
what PRON det
spacy NOUN nsubj
does VERB ccomp
with ADP prep
MWT PROPN pobj

>>> doc = nlp("I wanna lick Jennifer's antennae")
>>> for token in doc:
...     print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
wanna VERB ROOT
lick PROPN compound
Jennifer PROPN poss
's PART case
antennae NOUN dobj

>>> nlp = spacy.load('en_core_web_trf')
>>> doc = nlp("I wanna lick Jennifer's antennae")
>>> for token in doc:
...    print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
wanna AUX aux
lick VERB ROOT
Jennifer PROPN poss
's PART case
antennae NOUN dobj

>>> nlp = spacy.load('de_dep_news_trf')
>>> doc = nlp("Der Firma liegt genau am Ortseingang.")
>>> for token in doc:
...    print(token.text, token.pos_, token.dep_)
...
Der DET nk
Firma NOUN da
liegt VERB ROOT
genau ADV mo
am ADP mo
Ortseingang NOUN nk
. PUNCT punct

So they are treating contractions as single words (although they do split clitics). IDK, maybe people prefer that representation

GeorgeS2019 commented 6 months ago

First

thx for taking your time to provide elaborate answer.

Many top tech companies are using stanza and CoreNLP. I saw the same mistake and I am here to feedback.

German langauge is no doubt a very challenging langauge.

I am here to learn and feedback :-)

CoreNLP ( lemma of "am" => "be" <=??

[Text=am CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=VBP Lemma=be NamedEntityTag=O]

==> VBP is Unfortunately not correct.

From ChatGPT

The lemma of “am” would be “an” and "dem"

UD Trebank

Correct!

5-6     am      _       _       _       _       _       _       _       _
5       an      an      ADP     APPR    _       7       case    _       _
6       dem     der     DET     ART     Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art      7       det     _       _

an => ADP (Preposition) dem => DET (Determinant)

Spacy

"an dem" => "an" is a preposition and "dem" is a determinant article in Dative form.

Therefore ADP (Preposition) is correct for "am"

GeorgeS2019 commented 6 months ago

thx for tips how to parse. really helpful.

AngledLuffa commented 6 months ago

A ha ha I accidentally used the English CoreNLP instead of German. Let me revise...

NLP> Der Firma liegt genau am Ortseingang.

Sentence #1 (8 tokens):
Der Firma liegt genau am Ortseingang.

Tokens:
[Text=Der CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DET NamedEntityTag=O]
[Text=Firma CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NOUN NamedEntityTag=O]
[Text=liegt CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=VERB NamedEntityTag=O]
[Text=genau CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=ADV NamedEntityTag=O]
[Text=an CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=ADP NamedEntityTag=O]
[Text=dem CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=DET NamedEntityTag=O]
[Text=Ortseingang CharacterOffsetBegin=25 CharacterOffsetEnd=36 PartOfSpeech=NOUN NamedEntityTag=O]
[Text=. CharacterOffsetBegin=36 CharacterOffsetEnd=37 PartOfSpeech=PUNCT NamedEntityTag=O]

Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, liegt-3)
det(Firma-2, Der-1)
nsubj(liegt-3, Firma-2)
advmod(Ortseingang-7, genau-4)
case(Ortseingang-7, an-5)
det(Ortseingang-7, dem-6)
obl:an(liegt-3, Ortseingang-7)
punct(liegt-3, .-8)

Okay, that's much better. It also splits am, then labels the start and end characters as the same (overlapping) text positions as the original word. So effectively it's the same design choice as made in Stanza, but without an explicit marker that it was a multi-word token.

GeorgeS2019 commented 6 months ago

a multi-word token I have yet to appreciate the benefits of treating it as a multi-word token.

So far, I only know very limited langauges.

GeorgeS2019 commented 6 months ago

Stanza

I could be doing it wrong.

I doubt I get start_char and end_char for "an" and "dem"

foreach word in sentence.words

 doc.sentences[idx].words()
{
    "id": [
      10,
      11
    ],
    "text": "am",
    "start_char": 56,
    "end_char": 58,
    "ner": "O",
    "multi_ner": [
      "O"
    ]
  },
  {
    "id": 10,
    "text": "an",
    "lemma": "an",
    "upos": "ADP",
    "xpos": "APPR",
    "head": 12,
    "deprel": "case"
  },
  {
    "id": 11,
    "text": "dem",
    "lemma": "der",
    "upos": "DET",
    "xpos": "ART",
    "feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
    "head": 12,
    "deprel": "det"
  }
AngledLuffa commented 6 months ago

True true. But what you can do is

>>> doc = pipe("Der Firma liegt genau am Ortseingang.")
>>> doc.sentences[0].words[4]
{
  "id": 5,
  "text": "an"
}
>>> doc.sentences[0].words[4].parent
[
  {
    "id": [
      5,
      6
    ],
    "text": "am",
    "start_char": 22,
    "end_char": 24
  },
  {
    "id": 5,
    "text": "an"
  },
  {
    "id": 6,
    "text": "dem"
  }
]
>>> doc.sentences[0].words[4].parent.start_char
22
>>> doc.sentences[0].words[4].parent.end_char
24
GeorgeS2019 commented 6 months ago

Thank you

valuable tip!!!

GeorgeS2019 commented 6 months ago

What could cause the wrong POS of "miaut" in "Der Hund bellt, die Katze miaut."?

"miaut" is not a verb in stanza. I am curious how this could happen.

AngledLuffa commented 6 months ago

It is a verb if you use the default_accurate models. That has the more accurate constituency parser, anyway, so I would suggest doing that if accurate constituency parses are desired

As for the root cause, that word doesn't show up in the training data, so all it has to go on are the embeddings and the context of the sentence. Sometimes it will get such a thing wrong

GeorgeS2019 commented 6 months ago

What I have learned over the last few weeks, one may need to go deeper into the source and how the training is done. Each approach seems to have perhaps more success with one case, while another is better with another case. I see in many ways the merits of how Stanza is approaching the subject.