Describe the bug
Running the constituency parser on a doc with n sentences can result in the ith Sentence object getting the constituency parse of the n-i-1th sentence.
text = # Long text is in the gist
import stanza
STANZA_PIPELINE = stanza.Pipeline('en',
processors='tokenize,lemma,pos,constituency',
tokenize_pretokenized=True,
tokenize_no_ssplit=True)
doc = STANZA_PIPELINE(text)
for sentence in doc.sentences:
print(" ".join([token.text for token in sentence.tokens]))
print(sentence.constituency)
print()
Output:
CHAPTER I TREATS OF THE PLACE...
(ROOT (S (SBAR (IN Although) (S (NP (PRP I)) (VP (VBP am) (RB not) (VP (VBN disposed)...
For a long time after it was ushered into this world of sorrow and trouble...
(ROOT (S (S (PP (IN For) (NP (NP (DT a) (JJ long) (NN time)) (SBAR (IN after) (S (NP (PRP it)) (VP (VBD was) (VP (VBN ushered) (PP (IN into) (NP (NP (DT this) (NN world)) (PP (IN of) (NP (NN sorrow) (CC and) (NN trouble)))))))))))...
Although I am not disposed...
(ROOT (S (S (NP (NP (NN CHAPTER)) (SBAR (S (NP (PRP I)) (VP (VBZ TREATS) (PP (IN OF) (NP (NP (DT THE) (NN PLACE))...
Expected behavior
The parses should attach to the correct sentences; output should look like:
CHAPTER I TREATS OF THE PLACE...
(ROOT (S (S (NP (NP (NN CHAPTER)) (SBAR (S (NP (PRP I)) (VP (VBZ TREATS) (PP (IN OF) (NP (NP (DT THE) (NN PLACE))...
For a long time after it was ushered into this world of sorrow and trouble...
(ROOT (S (S (PP (IN For) (NP (NP (DT a) (JJ long) (NN time)) (SBAR (IN after) (S (NP (PRP it)) (VP (VBD was) (VP (VBN ushered) (PP (IN into) (NP (NP (DT this) (NN world)) (PP (IN of) (NP (NN sorrow) (CC and) (NN trouble)))))))))))...
Although I am not disposed...
(ROOT (S (SBAR (IN Although) (S (NP (PRP I)) (VP (VBP am) (RB not) (VP (VBN disposed)...
Environment (please complete the following information):
OS: MacOS
Python version: 3.9.7
Stanza version: 1.3.0
Additional context
The problem persists even if I use a Stanza tokenization model, but I added tokenize_pretokenized and tokenize_no_ssplit because I want to use the coref labels from LitBank.
I cannot reproduce this with shorter sentences.
This does also happen with the texts below, so it is not a Charles Dickens problem or a LitBank/preprocessing problem.
Phase the First : The Maiden I On an evening in the latter part of May a middle-aged man was walking homeward from Shaston to the village of Marlott , in the adjoining Vale of Blakemore , or Blackmoor .\n\nThe pair of legs that carried him were rickety , and there was a bias in his gait which inclined him somewhat to the left of a straight line .\n\nHe occasionally gave a smart nod , as if in confirmation of some opinion , though he was not thinking of anything in particular .
I will now proceed to write a very long sentence, one that contains many asides, with the hope of triggering whatever it is that caused the errors in the sentences I tried earlier, both of which came from novels in the LitBank dataset.\n\nAlas, I am not able to match the prolixity of Charles Dickens; nor, in fact, did I ever hope or intend to -- it is my desire merely to determine the exact conditions that cause the issues I have encountered, with the hope that these conditions are not somehow intrinsically linked to a problem with CoNLL formatting. It is very difficult to write long example sentences about nothing in particular; I have much to say about matters of importance, but none are relevant in this context.
Describe the bug Running the constituency parser on a doc with
n
sentences can result in thei
th Sentence object getting the constituency parse of then-i-1
th sentence.To Reproduce Steps to reproduce the behavior:
stanza.download('en')
Expected behavior The parses should attach to the correct sentences; output should look like:
Environment (please complete the following information):
Additional context
tokenize_pretokenized
andtokenize_no_ssplit
because I want to use the coref labels from LitBank.Phase the First : The Maiden I On an evening in the latter part of May a middle-aged man was walking homeward from Shaston to the village of Marlott , in the adjoining Vale of Blakemore , or Blackmoor .\n\nThe pair of legs that carried him were rickety , and there was a bias in his gait which inclined him somewhat to the left of a straight line .\n\nHe occasionally gave a smart nod , as if in confirmation of some opinion , though he was not thinking of anything in particular .
I will now proceed to write a very long sentence, one that contains many asides, with the hope of triggering whatever it is that caused the errors in the sentences I tried earlier, both of which came from novels in the LitBank dataset.\n\nAlas, I am not able to match the prolixity of Charles Dickens; nor, in fact, did I ever hope or intend to -- it is my desire merely to determine the exact conditions that cause the issues I have encountered, with the hope that these conditions are not somehow intrinsically linked to a problem with CoNLL formatting. It is very difficult to write long example sentences about nothing in particular; I have much to say about matters of importance, but none are relevant in this context.