nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
802 stars 83 forks source link

Different segmentation with Spacy and when using pySBD directly #55

Closed nmstoker closed 4 years ago

nmstoker commented 4 years ago

Firstly thank you for this project - I was lucky to find it and it is really useful

I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn't expected and then noticed that even with the text in one of your tests it acts differently in Spacy.

To reproduce run these two bits of code:

from pysbd.utils import PySBDFactory
nlp = spacy.blank('en')
nlp.add_pipe(PySBDFactory(nlp))
doc = nlp("She turned to him, \"This is great.\" She held the book out to show him.")
for sent in doc.sents:
    print(str(sent).strip() + '\n')

She turned to him, "This is great." She held the book out to show him.

import pysbd
text = "She turned to him, \"This is great.\" She held the book out to show him."
seg = pysbd.Segmenter(language="en", clean=False)
#print(seg.segment(text))
for sent in seg.segment(text):
    print(str(sent).strip() + '\n')

She turned to him, "This is great."

She held the book out to show him.

The second way is the desired output (based on the rules at least)

jenojp commented 4 years ago

Hey @nmstoker , I am having similar issues and I think I discovered what is going on, though I haven't gotten a chance to find the root cause in pysbd.

So if you run your example directly:

import pysbd
fake_note = "She turned to him, \"This is great.\" She held the book out to show him."

seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(fake_note))

Returns

[TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)]

If we replicate the spacy pipeline code found in pysbd/utils.py with a few print statements to see what's going on under the hood:

def test(doc):
    sents_char_spans = seg.segment(doc.text)
    print(sents_char_spans)
    char_spans = [doc.char_span(sent_span.start, sent_span.end)
                for sent_span in sents_char_spans]
    print(char_spans)
    start_token_ids = [span[0].idx for span in char_spans if span
                    is not None]
    for token in doc:
        token.is_sent_start = (True if token.idx
                            in start_token_ids else False)
    return doc

nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents])

We can see that sent_char_spans appears to match exactly the direct run. The problem appears to be when making char_spans because the start, end character indices do not match the spacy doc object. This returns None type spans.

#sents_char_spans
[TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)]

#char_spans
[She turned to him, "This is great.", None]

So if you run run the following you only get one sentence:

nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents])
['She turned to him, "This is great." She held the book out to show him.']

If you look at the character indices of the tokens without using pysbd, they get messed up once they hit your /"

nlp = spacy.blank("en")
doc = nlp(fake_note)
print([(token.text, token.idx) for token in doc])
[('She', 0), ('turned', 4), ('to', 11), ('him', 14), (',', 17), ('"', 19), ('This', 20), ('is', 25), ('great', 28), ('.', 33), ('"', 34), ('She', 36), ('held', 40), ('the', 45), ('book', 49), ('out', 54), ('to', 58), ('show', 61), ('him', 66), ('.', 69)]
#notice 34 instead of 35

I'm seeing similar issues when you have a series of special characters, such as this example:

fake_note = """
PHYSICAL EXAMINATION:  Vital signs:  Temperature 96.5??????, blood
pressure 158/49, pulse 76, respirations 14, oxygen saturation
98% on 2 L, 92% on room air.  General:  She was elderly,
lying in bed."""

If you remove the ? series, the problem goes away.

nmstoker commented 4 years ago

That's interesting. Great work digging into it @jenojp you got further than I did!

nipunsadvilkar commented 4 years ago

@nmstoker Thanks for appreciating and noticing the issue. The issue is known to me and as @jenojp illustrated with an example, he's right..matching pysbd character offset indices with spaCy's Doc object is a bit tricky and which is why we see that disparity in the output. doc.sents requires tok i.e., doc[0].is_sent_start attribute to be set to True..logic is written in such a way that if we get proper span then it becomes very straightforward and we get neat results. On the other hand, if char_span returns None then we lose out capturing the sentence.

I have been wanting to resolve this issue but haven't found much time, I will see if I could do something about it in the near future. Though, it would be great if anyone could come up with a solution. That would be very welcoming contribution. Thanks again for pointing it out.

jenojp commented 4 years ago

@nipunsadvilkar I'll keep you posted if I can get some free time to look into it more. This is a really promising project!

nipunsadvilkar commented 4 years ago

@jenojp Have a look at a new issue which I just created. The solution might work to get proper segmentation in both with or without using spaCy.

nipunsadvilkar commented 4 years ago

Fixed #63