Closed nmstoker closed 4 years ago
Hey @nmstoker , I am having similar issues and I think I discovered what is going on, though I haven't gotten a chance to find the root cause in pysbd.
So if you run your example directly:
import pysbd
fake_note = "She turned to him, \"This is great.\" She held the book out to show him."
seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(fake_note))
Returns
[TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)]
If we replicate the spacy pipeline code found in pysbd/utils.py
with a few print statements to see what's going on under the hood:
def test(doc):
sents_char_spans = seg.segment(doc.text)
print(sents_char_spans)
char_spans = [doc.char_span(sent_span.start, sent_span.end)
for sent_span in sents_char_spans]
print(char_spans)
start_token_ids = [span[0].idx for span in char_spans if span
is not None]
for token in doc:
token.is_sent_start = (True if token.idx
in start_token_ids else False)
return doc
nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents])
We can see that sent_char_spans
appears to match exactly the direct run. The problem appears to be when making char_spans
because the start, end character indices do not match the spacy doc object. This returns None type spans.
#sents_char_spans
[TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)]
#char_spans
[She turned to him, "This is great.", None]
So if you run run the following you only get one sentence:
nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents])
['She turned to him, "This is great." She held the book out to show him.']
If you look at the character indices of the tokens without using pysbd, they get messed up once they hit your /"
nlp = spacy.blank("en")
doc = nlp(fake_note)
print([(token.text, token.idx) for token in doc])
[('She', 0), ('turned', 4), ('to', 11), ('him', 14), (',', 17), ('"', 19), ('This', 20), ('is', 25), ('great', 28), ('.', 33), ('"', 34), ('She', 36), ('held', 40), ('the', 45), ('book', 49), ('out', 54), ('to', 58), ('show', 61), ('him', 66), ('.', 69)]
#notice 34 instead of 35
I'm seeing similar issues when you have a series of special characters, such as this example:
fake_note = """
PHYSICAL EXAMINATION: Vital signs: Temperature 96.5??????, blood
pressure 158/49, pulse 76, respirations 14, oxygen saturation
98% on 2 L, 92% on room air. General: She was elderly,
lying in bed."""
If you remove the ? series, the problem goes away.
That's interesting. Great work digging into it @jenojp you got further than I did!
@nmstoker Thanks for appreciating and noticing the issue. The issue is known to me and as @jenojp illustrated with an example, he's right..matching pysbd character offset indices with spaCy's Doc
object is a bit tricky and which is why we see that disparity in the output. doc.sents
requires tok
i.e., doc[0].is_sent_start
attribute to be set to True
..logic is written in such a way that if we get proper span then it becomes very straightforward and we get neat results. On the other hand, if char_span
returns None
then we lose out capturing the sentence.
I have been wanting to resolve this issue but haven't found much time, I will see if I could do something about it in the near future. Though, it would be great if anyone could come up with a solution. That would be very welcoming contribution. Thanks again for pointing it out.
@nipunsadvilkar I'll keep you posted if I can get some free time to look into it more. This is a really promising project!
@jenojp Have a look at a new issue which I just created. The solution might work to get proper segmentation in both with or without using spaCy.
Fixed #63
Firstly thank you for this project - I was lucky to find it and it is really useful
I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn't expected and then noticed that even with the text in one of your tests it acts differently in Spacy.
To reproduce run these two bits of code:
The second way is the desired output (based on the rules at least)