stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.31k stars 896 forks source link

Constituency parses added to incorrect sentences (English) #970

Closed nnkennard closed 2 years ago

nnkennard commented 2 years ago

Describe the bug Running the constituency parser on a doc with n sentences can result in the ith Sentence object getting the constituency parse of the n-i-1th sentence.

To Reproduce Steps to reproduce the behavior:

  1. Run stanza.download('en')
  2. Run this code (full code in this gist) :
text = # Long text is in the gist
import stanza
STANZA_PIPELINE = stanza.Pipeline('en',
                                  processors='tokenize,lemma,pos,constituency',
                                  tokenize_pretokenized=True,
                                  tokenize_no_ssplit=True)
doc = STANZA_PIPELINE(text)
for sentence in doc.sentences:
    print(" ".join([token.text for token in sentence.tokens]))
    print(sentence.constituency)
    print()
  1. Output:
CHAPTER I TREATS OF THE PLACE...
(ROOT (S (SBAR (IN Although) (S (NP (PRP I)) (VP (VBP am) (RB not) (VP (VBN disposed)...

For a long time after it was ushered into this world of sorrow and trouble...
(ROOT (S (S (PP (IN For) (NP (NP (DT a) (JJ long) (NN time)) (SBAR (IN after) (S (NP (PRP it)) (VP (VBD was) (VP (VBN ushered) (PP (IN into) (NP (NP (DT this) (NN world)) (PP (IN of) (NP (NN sorrow) (CC and) (NN trouble)))))))))))...

Although I am not disposed...
(ROOT (S (S (NP (NP (NN CHAPTER)) (SBAR (S (NP (PRP I)) (VP (VBZ TREATS) (PP (IN OF) (NP (NP (DT THE) (NN PLACE))...

Expected behavior The parses should attach to the correct sentences; output should look like:

CHAPTER I TREATS OF THE PLACE...
(ROOT (S (S (NP (NP (NN CHAPTER)) (SBAR (S (NP (PRP I)) (VP (VBZ TREATS) (PP (IN OF) (NP (NP (DT THE) (NN PLACE))...

For a long time after it was ushered into this world of sorrow and trouble...
(ROOT (S (S (PP (IN For) (NP (NP (DT a) (JJ long) (NN time)) (SBAR (IN after) (S (NP (PRP it)) (VP (VBD was) (VP (VBN ushered) (PP (IN into) (NP (NP (DT this) (NN world)) (PP (IN of) (NP (NN sorrow) (CC and) (NN trouble)))))))))))...

Although I am not disposed...
(ROOT (S (SBAR (IN Although) (S (NP (PRP I)) (VP (VBP am) (RB not) (VP (VBN disposed)...

Environment (please complete the following information):

Additional context

AngledLuffa commented 2 years ago

https://github.com/stanfordnlp/stanza/issues/919

nnkennard commented 2 years ago

Thanks!