stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.29k stars 894 forks source link

Inaccurate Dependency Tagging for Subordinates (ccomp) #1114

Open fatihbozdag opened 2 years ago

fatihbozdag commented 2 years ago

Greetings all,

I working on extracting subordinate clauses via Stanza (Indeed through spacy-stanze); however, dependency parsing seems to provide inaccurate results.

Following the guide from https://universaldependencies.org here, clausal subjects are tagged as csubj. For instance, the expected results should be as follows;

import stanza
import spacy_stanza
import pandas as pd
nlp = spacy_stanza.load_pipeline("en")
sentence = 'what she said makes sense' 

for t in doc:
    print(t.text, t.dep_, t.head.text)

what dobj said
she nsubj said
said csubj was
was ROOT was
well advmod received
received acomp was

However, this is the results I get;

What obj makes
she nsubj said
said acl:relcl What
makes root makes
sense obj makes
. punct makes

Stanza tags the item 'said'as a relative clause. As explained in This paper, the authors also used Stanza, yet I am not if it is a pretrained model or not. Why is the inconsistency? I've also tried other packages such as 'ewt' and got similar results. I am having a kind of the same issue with Spacy Models as well. Training a model from scratch would be beyond my knowledge. How should I proceed?

fatihbozdag commented 2 years ago

Update,

On CoreNLP demo page, I got the correct results. However, the result is still inaccurate on CoreNLP Python wrapper 'Stanza'.

CoreNLP demo page;

corenlp

Stanza;

import stanza
nlp = stanza.Pipeline('en')

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos, word.deprel)

What what PRON nsubj
she she PRON nsubj
said say VERB acl:relcl
makes make VERB root
sense sense NOUN obj
AngledLuffa commented 2 years ago

It actually does not appear they used stanza.  They used CoreNLP, which has a separate dependency parser. The thing you are referring to as a "wrapper" is actually Stanza's dependency model. In general it's more accurate than CoreNLP, but clearly in this case it is not.

https://stanfordnlp.github.io/stanza/corenlp_client.html

Another option is to give us a few examples of the parser getting this wrong.  We can add them to the training data and hopefully get a better result.  I'd want to run them by my PI to double check, though, which might mean a week or two or waiting, as he is currently away.

fatihbozdag commented 2 years ago

Sure, I will try with different sentences. How should I send you the results? Collect them in a text file?

fatihbozdag commented 2 years ago

So far, I've tried more than 30 sentences. There are problematic ones ( "deprel": "acl:relcl")

import stanza
nlp = stanza.Pipeline('en')

2022-09-07 02:33:31 INFO: Loading these models for language: en (English):
============================
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |
============================

2022-09-07 02:33:31 INFO: Use device: cpu
2022-09-07 02:33:31 INFO: Loading: tokenize
2022-09-07 02:33:31 INFO: Loading: pos
2022-09-07 02:33:32 INFO: Loading: lemma
2022-09-07 02:33:32 INFO: Loading: depparse
2022-09-07 02:33:32 INFO: Loading: sentiment
2022-09-07 02:33:32 INFO: Loading: constituency
2022-09-07 02:33:33 INFO: Loading: ner
2022-09-07 02:33:33 INFO: Done loading processors!

inaccurate_sents = 'Whatever she says makes sense. The thing that she did was well received. The news that she is quitting her job was shared in no time. Drunkenness reveals what soberness conceals.'

Whatever whatever PRON nsubj <bound method Word.pretty_print of {
  "id": 1,
  "text": "Whatever",
  "lemma": "whatever",
  "upos": "PRON",
  "xpos": "WP",
  "feats": "PronType=Int",
  "head": 4,
  "deprel": "nsubj",
  "start_char": 0,
  "end_char": 8
}>
she she PRON nsubj <bound method Word.pretty_print of {
  "id": 2,
  "text": "she",
  "lemma": "she",
  "upos": "PRON",
  "xpos": "PRP",
  "feats": "Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs",
  "head": 3,
  "deprel": "nsubj",
  "start_char": 9,
  "end_char": 12
}>
says say VERB acl:relcl <bound method Word.pretty_print of {
  "id": 3,
  "text": "says",
  "lemma": "say",
  "upos": "VERB",
  "xpos": "VBZ",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
  "head": 1,
  "deprel": "acl:relcl",
  "start_char": 13,
  "end_char": 17
}>
makes make VERB root <bound method Word.pretty_print of {
  "id": 4,
  "text": "makes",
  "lemma": "make",
  "upos": "VERB",
  "xpos": "VBZ",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
  "head": 0,
  "deprel": "root",
  "start_char": 18,
  "end_char": 23
}>
sense sense NOUN obj <bound method Word.pretty_print of {
  "id": 5,
  "text": "sense",
  "lemma": "sense",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 4,
  "deprel": "obj",
  "start_char": 24,
  "end_char": 29
}>
. . PUNCT punct <bound method Word.pretty_print of {
  "id": 6,
  "text": ".",
  "lemma": ".",
  "upos": "PUNCT",
  "xpos": ".",
  "head": 4,
  "deprel": "punct",
  "start_char": 29,
  "end_char": 30
}>
The the DET det <bound method Word.pretty_print of {
  "id": 1,
  "text": "The",
  "lemma": "the",
  "upos": "DET",
  "xpos": "DT",
  "feats": "Definite=Def|PronType=Art",
  "head": 2,
  "deprel": "det",
  "start_char": 31,
  "end_char": 34
}>
thing thing NOUN nsubj:pass <bound method Word.pretty_print of {
  "id": 2,
  "text": "thing",
  "lemma": "thing",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 8,
  "deprel": "nsubj:pass",
  "start_char": 35,
  "end_char": 40
}>
that that PRON obj <bound method Word.pretty_print of {
  "id": 3,
  "text": "that",
  "lemma": "that",
  "upos": "PRON",
  "xpos": "WDT",
  "feats": "PronType=Rel",
  "head": 5,
  "deprel": "obj",
  "start_char": 41,
  "end_char": 45
}>
she she PRON nsubj <bound method Word.pretty_print of {
  "id": 4,
  "text": "she",
  "lemma": "she",
  "upos": "PRON",
  "xpos": "PRP",
  "feats": "Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs",
  "head": 5,
  "deprel": "nsubj",
  "start_char": 46,
  "end_char": 49
}>
did do VERB acl:relcl <bound method Word.pretty_print of {
  "id": 5,
  "text": "did",
  "lemma": "do",
  "upos": "VERB",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 2,
  "deprel": "acl:relcl",
  "start_char": 50,
  "end_char": 53
}>
was be AUX aux:pass <bound method Word.pretty_print of {
  "id": 6,
  "text": "was",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 8,
  "deprel": "aux:pass",
  "start_char": 54,
  "end_char": 57
}>
well well ADV advmod <bound method Word.pretty_print of {
  "id": 7,
  "text": "well",
  "lemma": "well",
  "upos": "ADV",
  "xpos": "RB",
  "feats": "Degree=Pos",
  "head": 8,
  "deprel": "advmod",
  "start_char": 58,
  "end_char": 62
}>
received receive VERB root <bound method Word.pretty_print of {
  "id": 8,
  "text": "received",
  "lemma": "receive",
  "upos": "VERB",
  "xpos": "VBN",
  "feats": "Tense=Past|VerbForm=Part|Voice=Pass",
  "head": 0,
  "deprel": "root",
  "start_char": 63,
  "end_char": 71
}>
. . PUNCT punct <bound method Word.pretty_print of {
  "id": 9,
  "text": ".",
  "lemma": ".",
  "upos": "PUNCT",
  "xpos": ".",
  "head": 8,
  "deprel": "punct",
  "start_char": 71,
  "end_char": 72
}>
The the DET det <bound method Word.pretty_print of {
  "id": 1,
  "text": "The",
  "lemma": "the",
  "upos": "DET",
  "xpos": "DT",
  "feats": "Definite=Def|PronType=Art",
  "head": 2,
  "deprel": "det",
  "start_char": 73,
  "end_char": 76
}>
news news NOUN nsubj:pass <bound method Word.pretty_print of {
  "id": 2,
  "text": "news",
  "lemma": "news",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 10,
  "deprel": "nsubj:pass",
  "start_char": 77,
  "end_char": 81
}>
that that PRON obj <bound method Word.pretty_print of {
  "id": 3,
  "text": "that",
  "lemma": "that",
  "upos": "PRON",
  "xpos": "WDT",
  "feats": "PronType=Rel",
  "head": 6,
  "deprel": "obj",
  "start_char": 82,
  "end_char": 86
}>
she she PRON nsubj <bound method Word.pretty_print of {
  "id": 4,
  "text": "she",
  "lemma": "she",
  "upos": "PRON",
  "xpos": "PRP",
  "feats": "Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs",
  "head": 6,
  "deprel": "nsubj",
  "start_char": 87,
  "end_char": 90
}>
is be AUX aux <bound method Word.pretty_print of {
  "id": 5,
  "text": "is",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBZ",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
  "head": 6,
  "deprel": "aux",
  "start_char": 91,
  "end_char": 93
}>
quitting quit VERB acl:relcl <bound method Word.pretty_print of {
  "id": 6,
  "text": "quitting",
  "lemma": "quit",
  "upos": "VERB",
  "xpos": "VBG",
  "feats": "Tense=Pres|VerbForm=Part",
  "head": 2,
  "deprel": "acl:relcl",
  "start_char": 94,
  "end_char": 102
}>
her she PRON nmod:poss <bound method Word.pretty_print of {
  "id": 7,
  "text": "her",
  "lemma": "she",
  "upos": "PRON",
  "xpos": "PRP$",
  "feats": "Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs",
  "head": 8,
  "deprel": "nmod:poss",
  "start_char": 103,
  "end_char": 106
}>
job job NOUN obj <bound method Word.pretty_print of {
  "id": 8,
  "text": "job",
  "lemma": "job",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 6,
  "deprel": "obj",
  "start_char": 107,
  "end_char": 110
}>
was be AUX aux:pass <bound method Word.pretty_print of {
  "id": 9,
  "text": "was",
  "lemma": "be",
  "upos": "AUX",
  "xpos": "VBD",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
  "head": 10,
  "deprel": "aux:pass",
  "start_char": 111,
  "end_char": 114
}>
shared share VERB root <bound method Word.pretty_print of {
  "id": 10,
  "text": "shared",
  "lemma": "share",
  "upos": "VERB",
  "xpos": "VBN",
  "feats": "Tense=Past|VerbForm=Part|Voice=Pass",
  "head": 0,
  "deprel": "root",
  "start_char": 115,
  "end_char": 121
}>
in in ADP case <bound method Word.pretty_print of {
  "id": 11,
  "text": "in",
  "lemma": "in",
  "upos": "ADP",
  "xpos": "IN",
  "head": 13,
  "deprel": "case",
  "start_char": 122,
  "end_char": 124
}>
no no DET det <bound method Word.pretty_print of {
  "id": 12,
  "text": "no",
  "lemma": "no",
  "upos": "DET",
  "xpos": "DT",
  "head": 13,
  "deprel": "det",
  "start_char": 125,
  "end_char": 127
}>
time time NOUN obl <bound method Word.pretty_print of {
  "id": 13,
  "text": "time",
  "lemma": "time",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 10,
  "deprel": "obl",
  "start_char": 128,
  "end_char": 132
}>
. . PUNCT punct <bound method Word.pretty_print of {
  "id": 14,
  "text": ".",
  "lemma": ".",
  "upos": "PUNCT",
  "xpos": ".",
  "head": 10,
  "deprel": "punct",
  "start_char": 132,
  "end_char": 133
}>
Drunkenness drunkenness NOUN nsubj <bound method Word.pretty_print of {
  "id": 1,
  "text": "Drunkenness",
  "lemma": "drunkenness",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 2,
  "deprel": "nsubj",
  "start_char": 134,
  "end_char": 145
}>
reveals reveal VERB root <bound method Word.pretty_print of {
  "id": 2,
  "text": "reveals",
  "lemma": "reveal",
  "upos": "VERB",
  "xpos": "VBZ",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
  "head": 0,
  "deprel": "root",
  "start_char": 146,
  "end_char": 153
}>
what what PRON obj <bound method Word.pretty_print of {
  "id": 3,
  "text": "what",
  "lemma": "what",
  "upos": "PRON",
  "xpos": "WP",
  "feats": "PronType=Int",
  "head": 2,
  "deprel": "obj",
  "start_char": 154,
  "end_char": 158
}>
soberness soberness NOUN nsubj <bound method Word.pretty_print of {
  "id": 4,
  "text": "soberness",
  "lemma": "soberness",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 5,
  "deprel": "nsubj",
  "start_char": 159,
  "end_char": 168
}>
conceals conceal VERB acl:relcl <bound method Word.pretty_print of {
  "id": 5,
  "text": "conceals",
  "lemma": "conceal",
  "upos": "VERB",
  "xpos": "VBZ",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
  "head": 3,
  "deprel": "acl:relcl",
  "start_char": 169,
  "end_char": 177
}>
. . PUNCT punct <bound method Word.pretty_print of {
  "id": 6,
  "text": ".",
  "lemma": ".",
  "upos": "PUNCT",
  "xpos": ".",
  "head": 2,
  "deprel": "punct",
  "start_char": 177,
  "end_char": 178
}>
AngledLuffa commented 2 years ago

That's a good start, but I think we probably need more than 4 to move the needle much

fatihbozdag commented 2 years ago

Alright then, I will update this post as I try more sentences.

I guess the problem is mostly related to two distinct patterns, for instance; 'what + dependent clause, independent clause' and ' Nsubj + that + clause, independent clause'

fatihbozdag commented 2 years ago

Another inaccurate one. This one actually is a sample sentence (Page 3) from Stanford typed dependencies manual.

It was exemplified as: “That she lied was suspected by everyone” csubjpass(suspected, lied)"

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 154kB [00:00, 9.85MB/s]                    
2022-09-08 15:12:52 INFO: Loading these models for language: en (English):
============================
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |
============================

2022-09-08 15:12:52 INFO: Use device: cpu
2022-09-08 15:12:52 INFO: Loading: tokenize
2022-09-08 15:12:52 INFO: Loading: pos
2022-09-08 15:12:52 INFO: Loading: lemma
2022-09-08 15:12:52 INFO: Loading: depparse
2022-09-08 15:12:53 INFO: Loading: sentiment
2022-09-08 15:12:53 INFO: Loading: constituency
2022-09-08 15:12:53 INFO: Loading: ner
2022-09-08 15:12:54 INFO: Done loading processors!

doc = nlp('That she lied was suspected by everyone')

for token in doc:
    print(token.text, token.dep_)

That mark
she nsubj
lied nsubj:pass
was aux:pass
suspected root
by case
everyone obl:agent

It would be better, I guess, to parse a whole corpus and extract problematic sentences. However, I was unable to find a dataset with complement clauses.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically closed due to inactivity.

AngledLuffa commented 1 year ago

circling back to this, i don't think (although i'm not a UD expert) that these should be csubj:

The thing that she did was well received. The news that she is quitting her job was shared in no time.

The thing that she did and The news... are both concrete things

fatihbozdag commented 1 year ago

So then it is a documentation issue? as in here https://universaldependencies.org/u/dep/csubj.html.

If not csubj, then what should be the correct tag?

AngledLuffa commented 1 year ago

No, I think those examples are correct. The difference is that The thing and The news are objects whereas What she did is a clause. Although as I said earlier, I might be wrong here, as I'm not an expert.

AngledLuffa commented 1 year ago

so for example, the one rule we have for detecting csubj in a constituency tree is (this is in our tree grep format):

            "S < (SBAR|S=target !$+ /^,$/ $++ (VP !$-- NP))");

which fits the distinction in parse trees for the following two sentences:

>>> print("{:P}".format(pipe("What she said makes sense").sentences[0].constituency))
(ROOT
  (S
    (SBAR
      (WHNP (WP What))
      (S
        (NP (PRP she))
        (VP (VBD said))))
    (VP
      (VBZ makes)
      (NP (NN sense)))))

>>> print("{:P}".format(pipe("That makes sense").sentences[0].constituency))
(ROOT
  (S
    (NP (DT That))
    (VP
      (VBZ makes)
      (NP (NN sense)))))

So I do believe with reasonably high confidence that What she did... results in a csubj and The thing... does not

for reference, csubjpass is converted from a constituency tree to dependencies with the following two rules:

            "S < (SBAR|S=target !$+ /^,$/ $++ (VP < (VP < VBN|VBD) < (/^(?:VB|AUXG?)/ < " + passiveAuxWordRegex + ") !$-- NP))",
            "S < (SBAR|S=target !$+ /^,$/ $++ (VP <+(VP) (VP < VBN|VBD > (VP < (/^(?:VB|AUX)/ < " + passiveAuxWordRegex + "))) !$-- NP))");
AngledLuffa commented 1 year ago

I wrote up a little thing which checks how well the depparse does for each dependency. As it turns out, it gets an F1 of 0.6792 on csubj, which is one of the worst individual F1 scores. Perhaps not surprising considering there are only 362 in the EWT training data, vs 19516 nsubj, for example.

The problem with this situation is that 362 examples of csubj is already enough that adding more to significantly change the balance of csubj compared to acl:relcl would be rather tedious. I can see why the constituency parser would do a better job of detecting it - the SBAR / VP pattern is pretty easy to build correctly, and then the conversion to dependencies is deterministic.

If this is a significant limitation (and you're still working on detecting csubj, sorry, it was quite a while ago) we could make a python interface to the CoreNLP dependency converter, which would presumably help detect csubj in English contexts as long as you use the constituency parser. If it was more of a case of noticing that csubj is less accurate than the other dependencies, I would say you're right, but unfortunately there isn't a great solution for improving the csubj results.

fatihbozdag commented 1 year ago

I am still following the topic. Well, many interfaces/scripts etc, heavily rely on the parser and I do not really think studies published so far (from Corpus Linguistics) are well aware of the issue. Such a python interface would be very handy.

AngledLuffa commented 1 year ago

I added a tool which converts constituency trees to dependencies. It gets csubj correct for a couple of the examples you gave:

pipe = stanza.Pipeline("en", processors="tokenize,pos,constituency,depparse", depparse_with_converter=True)
doc = pipe("What she said makes sense")
print("{:C}".format(doc))
# text = What she said makes sense
# sent_id = 0
# constituency = (ROOT (S (SBAR (WHNP (WP What)) (S (NP (PRP she)) (VP (VBD said)))) (VP (VBZ makes) (NP (NN sense)))))
1       What    _       PRON    WP      PronType=Int    3       obj     _       start_char=0|end_char=4
2       she     _       PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   3       nsubj   _       start_char=5|end_char=8
3       said    _       VERB    VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   4       csubj   _       start_char=9|end_char=13
4       makes   _       VERB    VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0       root    _       start_char=14|end_char=19
5       sense   _       NOUN    NN      Number=Sing     4       obj     _       start_char=20|end_char=25

It doesn't get the csubj:pass in That she lied was suspected by everyone because the parse tree is wrong, though. There's a more accurate parser using Roberta, though, which does get it right:

pipe = stanza.Pipeline("en", processors="tokenize,pos,constituency,depparse", package={"depparse": "converter", "constituency": "wsj_bert"})
doc = pipe("That she lied was suspected by everyone")
print("{:C}".format(doc))
# text = That she lied was suspected by everyone
# sent_id = 0
# constituency = (ROOT (S (SBAR (IN That) (S (NP (PRP she)) (VP (VBD lied)))) (VP (VBD was) (VP (VBN suspected) (PP (IN by) (NP (NN everyone)))))))
1       That    _       SCONJ   IN      _       3       mark    _       start_char=0|end_char=4
2       she     _       PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   3       nsubj   _       start_char=5|end_char=8
3       lied    _       VERB    VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   5       csubj:pass      _       start_char=9|end_char=13
4       was     _       AUX     VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   5       aux:pass        _       start_char=14|end_char=17
5       suspected       _       VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pass     0       root    _       start_char=18|end_char=27
6       by      _       ADP     IN      _       7       case    _       start_char=28|end_char=30
7       everyone        _       PRON    NN      Number=Sing|PronType=Tot        5       obl     _       start_char=31|end_char=39

I haven't scored the overall accuracy of the Roberta parser converter to dependencies, but hopefully it's doing alright.

In order to use these things, you'll need to download & build the CoreNLP dev branch and the Stanza dev branch as well. I will try to make a new release of both by the end of the month.

AngledLuffa commented 1 year ago

... overall the results are not great, I have to say. LAS of 81.53 for the Roberta constituency parser, as opposed to 88.50 for the dedicated dependency parser. I would recommend just sticking with the current dependency parser, even though it isn't that accurate for csubj.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.