Open fatihbozdag opened 2 years ago
Update,
On CoreNLP demo page, I got the correct results. However, the result is still inaccurate on CoreNLP Python wrapper 'Stanza
'.
CoreNLP demo page;
Stanza;
import stanza
nlp = stanza.Pipeline('en')
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.lemma, word.pos, word.deprel)
What what PRON nsubj
she she PRON nsubj
said say VERB acl:relcl
makes make VERB root
sense sense NOUN obj
It actually does not appear they used stanza. They used CoreNLP, which has a separate dependency parser. The thing you are referring to as a "wrapper" is actually Stanza's dependency model. In general it's more accurate than CoreNLP, but clearly in this case it is not.
https://stanfordnlp.github.io/stanza/corenlp_client.html
Another option is to give us a few examples of the parser getting this wrong. We can add them to the training data and hopefully get a better result. I'd want to run them by my PI to double check, though, which might mean a week or two or waiting, as he is currently away.
Sure, I will try with different sentences. How should I send you the results? Collect them in a text file?
So far, I've tried more than 30 sentences. There are problematic ones ( "deprel": "acl:relcl"
)
import stanza
nlp = stanza.Pipeline('en')
2022-09-07 02:33:31 INFO: Loading these models for language: en (English):
============================
| Processor | Package |
----------------------------
| tokenize | combined |
| pos | combined |
| lemma | combined |
| depparse | combined |
| sentiment | sstplus |
| constituency | wsj |
| ner | ontonotes |
============================
2022-09-07 02:33:31 INFO: Use device: cpu
2022-09-07 02:33:31 INFO: Loading: tokenize
2022-09-07 02:33:31 INFO: Loading: pos
2022-09-07 02:33:32 INFO: Loading: lemma
2022-09-07 02:33:32 INFO: Loading: depparse
2022-09-07 02:33:32 INFO: Loading: sentiment
2022-09-07 02:33:32 INFO: Loading: constituency
2022-09-07 02:33:33 INFO: Loading: ner
2022-09-07 02:33:33 INFO: Done loading processors!
inaccurate_sents = 'Whatever she says makes sense. The thing that she did was well received. The news that she is quitting her job was shared in no time. Drunkenness reveals what soberness conceals.'
Whatever whatever PRON nsubj <bound method Word.pretty_print of {
"id": 1,
"text": "Whatever",
"lemma": "whatever",
"upos": "PRON",
"xpos": "WP",
"feats": "PronType=Int",
"head": 4,
"deprel": "nsubj",
"start_char": 0,
"end_char": 8
}>
she she PRON nsubj <bound method Word.pretty_print of {
"id": 2,
"text": "she",
"lemma": "she",
"upos": "PRON",
"xpos": "PRP",
"feats": "Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs",
"head": 3,
"deprel": "nsubj",
"start_char": 9,
"end_char": 12
}>
says say VERB acl:relcl <bound method Word.pretty_print of {
"id": 3,
"text": "says",
"lemma": "say",
"upos": "VERB",
"xpos": "VBZ",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
"head": 1,
"deprel": "acl:relcl",
"start_char": 13,
"end_char": 17
}>
makes make VERB root <bound method Word.pretty_print of {
"id": 4,
"text": "makes",
"lemma": "make",
"upos": "VERB",
"xpos": "VBZ",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
"head": 0,
"deprel": "root",
"start_char": 18,
"end_char": 23
}>
sense sense NOUN obj <bound method Word.pretty_print of {
"id": 5,
"text": "sense",
"lemma": "sense",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 4,
"deprel": "obj",
"start_char": 24,
"end_char": 29
}>
. . PUNCT punct <bound method Word.pretty_print of {
"id": 6,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": ".",
"head": 4,
"deprel": "punct",
"start_char": 29,
"end_char": 30
}>
The the DET det <bound method Word.pretty_print of {
"id": 1,
"text": "The",
"lemma": "the",
"upos": "DET",
"xpos": "DT",
"feats": "Definite=Def|PronType=Art",
"head": 2,
"deprel": "det",
"start_char": 31,
"end_char": 34
}>
thing thing NOUN nsubj:pass <bound method Word.pretty_print of {
"id": 2,
"text": "thing",
"lemma": "thing",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 8,
"deprel": "nsubj:pass",
"start_char": 35,
"end_char": 40
}>
that that PRON obj <bound method Word.pretty_print of {
"id": 3,
"text": "that",
"lemma": "that",
"upos": "PRON",
"xpos": "WDT",
"feats": "PronType=Rel",
"head": 5,
"deprel": "obj",
"start_char": 41,
"end_char": 45
}>
she she PRON nsubj <bound method Word.pretty_print of {
"id": 4,
"text": "she",
"lemma": "she",
"upos": "PRON",
"xpos": "PRP",
"feats": "Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs",
"head": 5,
"deprel": "nsubj",
"start_char": 46,
"end_char": 49
}>
did do VERB acl:relcl <bound method Word.pretty_print of {
"id": 5,
"text": "did",
"lemma": "do",
"upos": "VERB",
"xpos": "VBD",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
"head": 2,
"deprel": "acl:relcl",
"start_char": 50,
"end_char": 53
}>
was be AUX aux:pass <bound method Word.pretty_print of {
"id": 6,
"text": "was",
"lemma": "be",
"upos": "AUX",
"xpos": "VBD",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
"head": 8,
"deprel": "aux:pass",
"start_char": 54,
"end_char": 57
}>
well well ADV advmod <bound method Word.pretty_print of {
"id": 7,
"text": "well",
"lemma": "well",
"upos": "ADV",
"xpos": "RB",
"feats": "Degree=Pos",
"head": 8,
"deprel": "advmod",
"start_char": 58,
"end_char": 62
}>
received receive VERB root <bound method Word.pretty_print of {
"id": 8,
"text": "received",
"lemma": "receive",
"upos": "VERB",
"xpos": "VBN",
"feats": "Tense=Past|VerbForm=Part|Voice=Pass",
"head": 0,
"deprel": "root",
"start_char": 63,
"end_char": 71
}>
. . PUNCT punct <bound method Word.pretty_print of {
"id": 9,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": ".",
"head": 8,
"deprel": "punct",
"start_char": 71,
"end_char": 72
}>
The the DET det <bound method Word.pretty_print of {
"id": 1,
"text": "The",
"lemma": "the",
"upos": "DET",
"xpos": "DT",
"feats": "Definite=Def|PronType=Art",
"head": 2,
"deprel": "det",
"start_char": 73,
"end_char": 76
}>
news news NOUN nsubj:pass <bound method Word.pretty_print of {
"id": 2,
"text": "news",
"lemma": "news",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 10,
"deprel": "nsubj:pass",
"start_char": 77,
"end_char": 81
}>
that that PRON obj <bound method Word.pretty_print of {
"id": 3,
"text": "that",
"lemma": "that",
"upos": "PRON",
"xpos": "WDT",
"feats": "PronType=Rel",
"head": 6,
"deprel": "obj",
"start_char": 82,
"end_char": 86
}>
she she PRON nsubj <bound method Word.pretty_print of {
"id": 4,
"text": "she",
"lemma": "she",
"upos": "PRON",
"xpos": "PRP",
"feats": "Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs",
"head": 6,
"deprel": "nsubj",
"start_char": 87,
"end_char": 90
}>
is be AUX aux <bound method Word.pretty_print of {
"id": 5,
"text": "is",
"lemma": "be",
"upos": "AUX",
"xpos": "VBZ",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
"head": 6,
"deprel": "aux",
"start_char": 91,
"end_char": 93
}>
quitting quit VERB acl:relcl <bound method Word.pretty_print of {
"id": 6,
"text": "quitting",
"lemma": "quit",
"upos": "VERB",
"xpos": "VBG",
"feats": "Tense=Pres|VerbForm=Part",
"head": 2,
"deprel": "acl:relcl",
"start_char": 94,
"end_char": 102
}>
her she PRON nmod:poss <bound method Word.pretty_print of {
"id": 7,
"text": "her",
"lemma": "she",
"upos": "PRON",
"xpos": "PRP$",
"feats": "Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs",
"head": 8,
"deprel": "nmod:poss",
"start_char": 103,
"end_char": 106
}>
job job NOUN obj <bound method Word.pretty_print of {
"id": 8,
"text": "job",
"lemma": "job",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 6,
"deprel": "obj",
"start_char": 107,
"end_char": 110
}>
was be AUX aux:pass <bound method Word.pretty_print of {
"id": 9,
"text": "was",
"lemma": "be",
"upos": "AUX",
"xpos": "VBD",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
"head": 10,
"deprel": "aux:pass",
"start_char": 111,
"end_char": 114
}>
shared share VERB root <bound method Word.pretty_print of {
"id": 10,
"text": "shared",
"lemma": "share",
"upos": "VERB",
"xpos": "VBN",
"feats": "Tense=Past|VerbForm=Part|Voice=Pass",
"head": 0,
"deprel": "root",
"start_char": 115,
"end_char": 121
}>
in in ADP case <bound method Word.pretty_print of {
"id": 11,
"text": "in",
"lemma": "in",
"upos": "ADP",
"xpos": "IN",
"head": 13,
"deprel": "case",
"start_char": 122,
"end_char": 124
}>
no no DET det <bound method Word.pretty_print of {
"id": 12,
"text": "no",
"lemma": "no",
"upos": "DET",
"xpos": "DT",
"head": 13,
"deprel": "det",
"start_char": 125,
"end_char": 127
}>
time time NOUN obl <bound method Word.pretty_print of {
"id": 13,
"text": "time",
"lemma": "time",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 10,
"deprel": "obl",
"start_char": 128,
"end_char": 132
}>
. . PUNCT punct <bound method Word.pretty_print of {
"id": 14,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": ".",
"head": 10,
"deprel": "punct",
"start_char": 132,
"end_char": 133
}>
Drunkenness drunkenness NOUN nsubj <bound method Word.pretty_print of {
"id": 1,
"text": "Drunkenness",
"lemma": "drunkenness",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 2,
"deprel": "nsubj",
"start_char": 134,
"end_char": 145
}>
reveals reveal VERB root <bound method Word.pretty_print of {
"id": 2,
"text": "reveals",
"lemma": "reveal",
"upos": "VERB",
"xpos": "VBZ",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
"head": 0,
"deprel": "root",
"start_char": 146,
"end_char": 153
}>
what what PRON obj <bound method Word.pretty_print of {
"id": 3,
"text": "what",
"lemma": "what",
"upos": "PRON",
"xpos": "WP",
"feats": "PronType=Int",
"head": 2,
"deprel": "obj",
"start_char": 154,
"end_char": 158
}>
soberness soberness NOUN nsubj <bound method Word.pretty_print of {
"id": 4,
"text": "soberness",
"lemma": "soberness",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 5,
"deprel": "nsubj",
"start_char": 159,
"end_char": 168
}>
conceals conceal VERB acl:relcl <bound method Word.pretty_print of {
"id": 5,
"text": "conceals",
"lemma": "conceal",
"upos": "VERB",
"xpos": "VBZ",
"feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
"head": 3,
"deprel": "acl:relcl",
"start_char": 169,
"end_char": 177
}>
. . PUNCT punct <bound method Word.pretty_print of {
"id": 6,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": ".",
"head": 2,
"deprel": "punct",
"start_char": 177,
"end_char": 178
}>
That's a good start, but I think we probably need more than 4 to move the needle much
Alright then, I will update this post as I try more sentences.
I guess the problem is mostly related to two distinct patterns, for instance; 'what + dependent clause, independent clause
' and ' Nsubj + that + clause, independent clause
'
Another inaccurate one. This one actually is a sample sentence (Page 3) from Stanford typed dependencies manual.
It was exemplified as: “That she lied was suspected by everyone” csubjpass(suspected, lied)"
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 9.85MB/s]
2022-09-08 15:12:52 INFO: Loading these models for language: en (English):
============================
| Processor | Package |
----------------------------
| tokenize | combined |
| pos | combined |
| lemma | combined |
| depparse | combined |
| sentiment | sstplus |
| constituency | wsj |
| ner | ontonotes |
============================
2022-09-08 15:12:52 INFO: Use device: cpu
2022-09-08 15:12:52 INFO: Loading: tokenize
2022-09-08 15:12:52 INFO: Loading: pos
2022-09-08 15:12:52 INFO: Loading: lemma
2022-09-08 15:12:52 INFO: Loading: depparse
2022-09-08 15:12:53 INFO: Loading: sentiment
2022-09-08 15:12:53 INFO: Loading: constituency
2022-09-08 15:12:53 INFO: Loading: ner
2022-09-08 15:12:54 INFO: Done loading processors!
doc = nlp('That she lied was suspected by everyone')
for token in doc:
print(token.text, token.dep_)
That mark
she nsubj
lied nsubj:pass
was aux:pass
suspected root
by case
everyone obl:agent
It would be better, I guess, to parse a whole corpus and extract problematic sentences. However, I was unable to find a dataset with complement clauses.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.
circling back to this, i don't think (although i'm not a UD expert) that these should be csubj
:
The thing that she did was well received. The news that she is quitting her job was shared in no time.
The thing that she did
and The news...
are both concrete things
So then it is a documentation issue? as in here https://universaldependencies.org/u/dep/csubj.html.
If not csubj,
then what should be the correct tag?
No, I think those examples are correct. The difference is that The thing
and The news
are objects whereas What she did
is a clause. Although as I said earlier, I might be wrong here, as I'm not an expert.
so for example, the one rule we have for detecting csubj
in a constituency tree is (this is in our tree grep format):
"S < (SBAR|S=target !$+ /^,$/ $++ (VP !$-- NP))");
which fits the distinction in parse trees for the following two sentences:
>>> print("{:P}".format(pipe("What she said makes sense").sentences[0].constituency))
(ROOT
(S
(SBAR
(WHNP (WP What))
(S
(NP (PRP she))
(VP (VBD said))))
(VP
(VBZ makes)
(NP (NN sense)))))
>>> print("{:P}".format(pipe("That makes sense").sentences[0].constituency))
(ROOT
(S
(NP (DT That))
(VP
(VBZ makes)
(NP (NN sense)))))
So I do believe with reasonably high confidence that What she did...
results in a csubj
and The thing...
does not
for reference, csubjpass
is converted from a constituency tree to dependencies with the following two rules:
"S < (SBAR|S=target !$+ /^,$/ $++ (VP < (VP < VBN|VBD) < (/^(?:VB|AUXG?)/ < " + passiveAuxWordRegex + ") !$-- NP))",
"S < (SBAR|S=target !$+ /^,$/ $++ (VP <+(VP) (VP < VBN|VBD > (VP < (/^(?:VB|AUX)/ < " + passiveAuxWordRegex + "))) !$-- NP))");
I wrote up a little thing which checks how well the depparse does for each dependency. As it turns out, it gets an F1 of 0.6792 on csubj, which is one of the worst individual F1 scores. Perhaps not surprising considering there are only 362 in the EWT training data, vs 19516 nsubj, for example.
The problem with this situation is that 362 examples of csubj is already enough that adding more to significantly change the balance of csubj compared to acl:relcl would be rather tedious. I can see why the constituency parser would do a better job of detecting it - the SBAR / VP pattern is pretty easy to build correctly, and then the conversion to dependencies is deterministic.
If this is a significant limitation (and you're still working on detecting csubj, sorry, it was quite a while ago) we could make a python interface to the CoreNLP dependency converter, which would presumably help detect csubj in English contexts as long as you use the constituency parser. If it was more of a case of noticing that csubj is less accurate than the other dependencies, I would say you're right, but unfortunately there isn't a great solution for improving the csubj results.
I am still following the topic. Well, many interfaces/scripts etc, heavily rely on the parser and I do not really think studies published so far (from Corpus Linguistics) are well aware of the issue. Such a python interface would be very handy.
I added a tool which converts constituency trees to dependencies. It gets csubj
correct for a couple of the examples you gave:
pipe = stanza.Pipeline("en", processors="tokenize,pos,constituency,depparse", depparse_with_converter=True)
doc = pipe("What she said makes sense")
print("{:C}".format(doc))
# text = What she said makes sense
# sent_id = 0
# constituency = (ROOT (S (SBAR (WHNP (WP What)) (S (NP (PRP she)) (VP (VBD said)))) (VP (VBZ makes) (NP (NN sense)))))
1 What _ PRON WP PronType=Int 3 obj _ start_char=0|end_char=4
2 she _ PRON PRP Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs 3 nsubj _ start_char=5|end_char=8
3 said _ VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 4 csubj _ start_char=9|end_char=13
4 makes _ VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ start_char=14|end_char=19
5 sense _ NOUN NN Number=Sing 4 obj _ start_char=20|end_char=25
It doesn't get the csubj:pass
in That she lied was suspected by everyone
because the parse tree is wrong, though. There's a more accurate parser using Roberta, though, which does get it right:
pipe = stanza.Pipeline("en", processors="tokenize,pos,constituency,depparse", package={"depparse": "converter", "constituency": "wsj_bert"})
doc = pipe("That she lied was suspected by everyone")
print("{:C}".format(doc))
# text = That she lied was suspected by everyone
# sent_id = 0
# constituency = (ROOT (S (SBAR (IN That) (S (NP (PRP she)) (VP (VBD lied)))) (VP (VBD was) (VP (VBN suspected) (PP (IN by) (NP (NN everyone)))))))
1 That _ SCONJ IN _ 3 mark _ start_char=0|end_char=4
2 she _ PRON PRP Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs 3 nsubj _ start_char=5|end_char=8
3 lied _ VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 csubj:pass _ start_char=9|end_char=13
4 was _ AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 aux:pass _ start_char=14|end_char=17
5 suspected _ VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root _ start_char=18|end_char=27
6 by _ ADP IN _ 7 case _ start_char=28|end_char=30
7 everyone _ PRON NN Number=Sing|PronType=Tot 5 obl _ start_char=31|end_char=39
I haven't scored the overall accuracy of the Roberta parser converter to dependencies, but hopefully it's doing alright.
In order to use these things, you'll need to download & build the CoreNLP dev branch and the Stanza dev branch as well. I will try to make a new release of both by the end of the month.
... overall the results are not great, I have to say. LAS of 81.53 for the Roberta constituency parser, as opposed to 88.50 for the dedicated dependency parser. I would recommend just sticking with the current dependency parser, even though it isn't that accurate for csubj.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Greetings all,
I working on extracting subordinate clauses via Stanza (Indeed through spacy-stanze); however, dependency parsing seems to provide inaccurate results.
Following the guide from https://universaldependencies.org here, clausal subjects are tagged as
csubj
. For instance, the expected results should be as follows;However, this is the results I get;
Stanza tags the item '
said'
as a relative clause. As explained in This paper, the authors also used Stanza, yet I am not if it is a pretrained model or not. Why is the inconsistency? I've also tried other packages such as 'ewt' and got similar results. I am having a kind of the same issue with Spacy Models as well. Training a model from scratch would be beyond my knowledge. How should I proceed?