stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.13k stars 880 forks source link

Exception thrown by coref processor #1397

Open tseanard opened 2 weeks ago

tseanard commented 2 weeks ago

https://github.com/stanfordnlp/stanza/blob/6e442a6199f7e466c57c02de8d2f9d516bdd5715/stanza/pipeline/coref_processor.py#L127

In certain cases, the line linked above throws an error and crashes out the coreference processer. Since the exception is unhandled, no document object is returned by the method. It took me a while to find the root of this issue, and I'm not sure of all of the inner workings of stanza so I don't know that I can create a robust fix that doesn't create issues somewhere else.

I discovered the issue when I was doing a naïve character based split across a long section of text (130,000 characters) of just breaking it into chunks that were 2k to 10k in size. I understand that trying to pass blocks of text that are split within sentences and even sometimes in the middle of a word is not specifically a use case to coreference resolution should be able to handle, but being new to stanza it was not clear to me that this is what was causing the issue.

This is the for loop that crashes (specifically end_word = word_pos[span[1]])

            for span_idx, span in enumerate(span_cluster):
                sent_id = sent_ids[span[0]]
                sentence = sentences[sent_id]
                start_word = word_pos[span[0]]
                end_word = word_pos[span[1]]
                # very UD specific test for most number of proper nouns in a mention
                # will do nothing if POS is not active (they will all be None)
                num_propn = sum(word.pos == 'PROPN' for word in sentence.words[start_word:end_word])

                if ((span[1] - span[0] > max_len) or
                    span[1] - span[0] == max_len and num_propn > max_propn):
                    max_len = span[1] - span[0]
                    best_span = span_idx
                    max_propn = num_propn

Condition to reproduce issue:

Workaround:

Use case where this is relevant:

Example text that causes issue: "Sometimes people are part of the problem, and sometimes they are the solution to it" Update to text that resolves the issue: "Sometimes people are part of the problem, and sometimes they are the solution to it."

AngledLuffa commented 1 week ago

Can reproduce. Thank you for calling this to our attention.