In certain cases, the line linked above throws an error and crashes out the coreference processer. Since the exception is unhandled, no document object is returned by the method. It took me a while to find the root of this issue, and I'm not sure of all of the inner workings of stanza so I don't know that I can create a robust fix that doesn't create issues somewhere else.
I discovered the issue when I was doing a naïve character based split across a long section of text (130,000 characters) of just breaking it into chunks that were 2k to 10k in size. I understand that trying to pass blocks of text that are split within sentences and even sometimes in the middle of a word is not specifically a use case to coreference resolution should be able to handle, but being new to stanza it was not clear to me that this is what was causing the issue.
This is the for loop that crashes (specifically end_word = word_pos[span[1]])
for span_idx, span in enumerate(span_cluster):
sent_id = sent_ids[span[0]]
sentence = sentences[sent_id]
start_word = word_pos[span[0]]
end_word = word_pos[span[1]]
# very UD specific test for most number of proper nouns in a mention
# will do nothing if POS is not active (they will all be None)
num_propn = sum(word.pos == 'PROPN' for word in sentence.words[start_word:end_word])
if ((span[1] - span[0] > max_len) or
span[1] - span[0] == max_len and num_propn > max_propn):
max_len = span[1] - span[0]
best_span = span_idx
max_propn = num_propn
Condition to reproduce issue:
Provide a block of text that ends with a word that is part of a coference span
Missing punctuation at the end of the text
Workaround:
Add any punctuation to the end of the text if the error is thrown
I tested period, comma, exclamation, question mark, comma, and colon and those all worked
newline \n and adding an extra space did not help as workaround attemps.
Use case where this is relevant:
I am processing massive amounts of text that was collected using OCR, so there are sometimes cases where punctuation gets missed or misread by the OCR.
Example text that causes issue: "Sometimes people are part of the problem, and sometimes they are the solution to it"
Update to text that resolves the issue: "Sometimes people are part of the problem, and sometimes they are the solution to it."
https://github.com/stanfordnlp/stanza/blob/6e442a6199f7e466c57c02de8d2f9d516bdd5715/stanza/pipeline/coref_processor.py#L127
In certain cases, the line linked above throws an error and crashes out the coreference processer. Since the exception is unhandled, no document object is returned by the method. It took me a while to find the root of this issue, and I'm not sure of all of the inner workings of stanza so I don't know that I can create a robust fix that doesn't create issues somewhere else.
I discovered the issue when I was doing a naïve character based split across a long section of text (130,000 characters) of just breaking it into chunks that were 2k to 10k in size. I understand that trying to pass blocks of text that are split within sentences and even sometimes in the middle of a word is not specifically a use case to coreference resolution should be able to handle, but being new to stanza it was not clear to me that this is what was causing the issue.
This is the for loop that crashes (specifically
end_word = word_pos[span[1]]
)Condition to reproduce issue:
Workaround:
Use case where this is relevant:
Example text that causes issue: "Sometimes people are part of the problem, and sometimes they are the solution to it" Update to text that resolves the issue: "Sometimes people are part of the problem, and sometimes they are the solution to it."