Closed GeorgeS2019 closed 3 months ago
I double if Spacy would handle this way, I am simply curious
This is a complicated question which comes up frequently, and people never seem to like the answer. However, my impression of that is probably the same as bullet holes in planes - only the people who don't like the answer show up on github.
This is what CoreNLP does:
edit: this whole German CoreNLP section was done with the wrong annotation pipeline, see below
NLP> Der Firma liegt genau am Ortseingang.
Sentence #1 (7 tokens):
Der Firma liegt genau am Ortseingang.
Tokens:
[Text=Der CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP Lemma=Der NamedEntityTag=O]
[Text=Firma CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NNP Lemma=Firma NamedEntityTag=O]
[Text=liegt CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=NN Lemma=liegt NamedEntityTag=O]
[Text=genau CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=NN Lemma=genau NamedEntityTag=O]
[Text=am CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=VBP Lemma=be NamedEntityTag=O]
[Text=Ortseingang CharacterOffsetBegin=25 CharacterOffsetEnd=36 PartOfSpeech=NNP Lemma=Ortseingang NamedEntityTag=PERSON]
[Text=. CharacterOffsetBegin=36 CharacterOffsetEnd=37 PartOfSpeech=. Lemma=. NamedEntityTag=O]
Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, Ortseingang-6)
compound(Firma-2, Der-1)
compound(genau-4, Firma-2)
compound(genau-4, liegt-3)
nsubj(Ortseingang-6, genau-4)
cop(Ortseingang-6, am-5)
punct(Ortseingang-6, .-7)
The original training data in the UD treebank was
# sent_id = train-s25
# text = Der Firma liegt genau am Ortseingang.
1 Der der DET ART Case=Nom|Definite=Def|Gender=Masc|Number=Sing|PronType=Art 2 det _ _
2 Firma Firma NOUN NN Case=Nom|Gender=Masc|Number=Sing 3 nsubj _ _
3 liegt liegen VERB VVFIN Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
4 genau genau ADV ADV _ 7 advmod _ _
5-6 am _ _ _ _ _ _ _ _
5 an an ADP APPR _ 7 case _ _
6 dem der DET ART Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art 7 det _ _
7 Ortseingang Ortseingang NOUN NN Case=Dat|Gender=Masc|Number=Sing 3 obl _ SpaceAfter=No
8 . . PUNCT $. _ 3 punct _ _
The thing with the CoreNLP representation is, am
is not a copular verb as far as I know. Google translate says it means "at the". Also, it's completely missing that the liegt
is the verb. Basically that representation sucks.
The problem is that am
in fact represents two words at the same time - the adposition and the determiner. If you just implement one tag for the entire token, probably the adposition, leaving out the determiner, that would be a little weird. Even more awkward would be a combination tag of some kind (although to be fair some datasets have adopted that approach, such as the Korean UD treebanks)
The solution UD adopted for most languages is to represent the text as a single token, am
in this case, and split the analysis into the two words, an
and dem
. It is true there are some inconveniences here as well, such as an
does not correspond to an actual start & end character. However, it makes analysis of words such as am
much easier, since now you can analyze both words that it represents in a proper manner.
This happens in other languages. In Spanish, the pronoun clitics get split from verbs - otherwise you'd have 10x as many verbs to analyze. In English, the entire class of possessives, standard contractions such as can't
, won't
, it's
, and colloquial contractions such as cannot
, gonna
, wanna
. Then at the edges you can have 20 response long threads on UD about kinda
or mighta
as possible additions to the splittable lexicon... (These kind of threads alternate between amusing me every time I kick one off and discouraging me from asking in the first place about the best way for our software to analyze specific text)
Long story short, if all you want is the analysis of the pieces, you can either filter out from the json / dict representation any token whose id isn't just an int, or you can call doc.sentences[idx].words()
instead of using the dict representation. That might be a little unsatisfying since it won't have character offsets in a language such as German, where the MWT don't split into easily understood pieces (compare to English, where we split cannot
-> can not
... how would you split am
as text?). The Word
objects each have a pointer to the enclosing Token
, though, and the Token
does have the start_char
and end_char
for the entire piece of text.
As for spacy, it does
>>> doc = nlp("I don't know what spacy does with MWT")
>>> for token in doc:
... print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
do AUX aux
n't PART neg
know VERB ROOT
what PRON det
spacy NOUN nsubj
does VERB ccomp
with ADP prep
MWT PROPN pobj
>>> doc = nlp("I wanna lick Jennifer's antennae")
>>> for token in doc:
... print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
wanna VERB ROOT
lick PROPN compound
Jennifer PROPN poss
's PART case
antennae NOUN dobj
>>> nlp = spacy.load('en_core_web_trf')
>>> doc = nlp("I wanna lick Jennifer's antennae")
>>> for token in doc:
... print(token.text, token.pos_, token.dep_)
...
I PRON nsubj
wanna AUX aux
lick VERB ROOT
Jennifer PROPN poss
's PART case
antennae NOUN dobj
>>> nlp = spacy.load('de_dep_news_trf')
>>> doc = nlp("Der Firma liegt genau am Ortseingang.")
>>> for token in doc:
... print(token.text, token.pos_, token.dep_)
...
Der DET nk
Firma NOUN da
liegt VERB ROOT
genau ADV mo
am ADP mo
Ortseingang NOUN nk
. PUNCT punct
So they are treating contractions as single words (although they do split clitics). IDK, maybe people prefer that representation
thx for taking your time to provide elaborate answer.
Many top tech companies are using stanza and CoreNLP. I saw the same mistake and I am here to feedback.
German langauge is no doubt a very challenging langauge.
I am here to learn and feedback :-)
[Text=am CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=VBP Lemma=be NamedEntityTag=O]
==> VBP is Unfortunately not correct.
The lemma of “am” would be “an” and "dem"
Correct!
5-6 am _ _ _ _ _ _ _ _
5 an an ADP APPR _ 7 case _ _
6 dem der DET ART Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art 7 det _ _
an => ADP (Preposition) dem => DET (Determinant)
"an dem" => "an" is a preposition and "dem" is a determinant article in Dative form.
Therefore ADP (Preposition) is correct for "am"
thx for tips how to parse. really helpful.
A ha ha I accidentally used the English CoreNLP instead of German. Let me revise...
NLP> Der Firma liegt genau am Ortseingang.
Sentence #1 (8 tokens):
Der Firma liegt genau am Ortseingang.
Tokens:
[Text=Der CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DET NamedEntityTag=O]
[Text=Firma CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NOUN NamedEntityTag=O]
[Text=liegt CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=VERB NamedEntityTag=O]
[Text=genau CharacterOffsetBegin=16 CharacterOffsetEnd=21 PartOfSpeech=ADV NamedEntityTag=O]
[Text=an CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=ADP NamedEntityTag=O]
[Text=dem CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=DET NamedEntityTag=O]
[Text=Ortseingang CharacterOffsetBegin=25 CharacterOffsetEnd=36 PartOfSpeech=NOUN NamedEntityTag=O]
[Text=. CharacterOffsetBegin=36 CharacterOffsetEnd=37 PartOfSpeech=PUNCT NamedEntityTag=O]
Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, liegt-3)
det(Firma-2, Der-1)
nsubj(liegt-3, Firma-2)
advmod(Ortseingang-7, genau-4)
case(Ortseingang-7, an-5)
det(Ortseingang-7, dem-6)
obl:an(liegt-3, Ortseingang-7)
punct(liegt-3, .-8)
Okay, that's much better. It also splits am
, then labels the start and end characters as the same (overlapping) text positions as the original word. So effectively it's the same design choice as made in Stanza, but without an explicit marker that it was a multi-word token.
a multi-word token I have yet to appreciate the benefits of treating it as a multi-word token.
So far, I only know very limited langauges.
I could be doing it wrong.
I doubt I get start_char and end_char for "an" and "dem"
foreach word in sentence.words
doc.sentences[idx].words()
{
"id": [
10,
11
],
"text": "am",
"start_char": 56,
"end_char": 58,
"ner": "O",
"multi_ner": [
"O"
]
},
{
"id": 10,
"text": "an",
"lemma": "an",
"upos": "ADP",
"xpos": "APPR",
"head": 12,
"deprel": "case"
},
{
"id": 11,
"text": "dem",
"lemma": "der",
"upos": "DET",
"xpos": "ART",
"feats": "Case=Dat|Definite=Def|Gender=Masc|Number=Sing|PronType=Art",
"head": 12,
"deprel": "det"
}
True true. But what you can do is
>>> doc = pipe("Der Firma liegt genau am Ortseingang.")
>>> doc.sentences[0].words[4]
{
"id": 5,
"text": "an"
}
>>> doc.sentences[0].words[4].parent
[
{
"id": [
5,
6
],
"text": "am",
"start_char": 22,
"end_char": 24
},
{
"id": 5,
"text": "an"
},
{
"id": 6,
"text": "dem"
}
]
>>> doc.sentences[0].words[4].parent.start_char
22
>>> doc.sentences[0].words[4].parent.end_char
24
valuable tip!!!
What could cause the wrong POS of "miaut" in "Der Hund bellt, die Katze miaut."?
"miaut" is not a verb in stanza. I am curious how this could happen.
It is a verb
if you use the default_accurate
models. That has the more accurate constituency parser, anyway, so I would suggest doing that if accurate constituency parses are desired
As for the root cause, that word doesn't show up in the training data, so all it has to go on are the embeddings and the context of the sentence. Sometimes it will get such a thing wrong
What I have learned over the last few weeks, one may need to go deeper into the source and how the training is done. Each approach seems to have perhaps more success with one case, while another is better with another case. I see in many ways the merits of how Stanza is approaching the subject.
Am
“Am” is a contraction of “an” and “dem”.
An dem
“An dem” is used when you want to keep “an” and “dem” separate for emphasis or clarity.
How Stanza handles them?
One word "am" with the right word id has TWO more additional words: "an dem"
It is simpler to just parse an int coming back from a word.id. Now, instead of int, it is an array referencing the TWO additional words
The challenges: The parent word has start_char and end_char, but the other morphological features are now transferred to the child word e.g. dem
Question
I wonder how best to handle this when parsing.
[1]
[2]