nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

Does not properly segment within quotations #118

Open Hgherzog opened 1 year ago

Hgherzog commented 1 year ago

When dealing with a long statement of facts quoted from legal text, the text is not split up within left double quotations and write double quotations. this is different than the " characterI cannot share the text here as it deals with sensitive content.

import pysbd seg = pysbd.Segmenter(language='en') sentences = seg.segment(above_text)

Returns a lot of length 1 and does not split by sentences. The expected behavior is to split up into sentences within the quotations.

libTorrentUser commented 1 year ago

I confirm the... bug? Not sure if it is a bug or intentional but

import pysbd

text = 'He said "hello. And then world."'
seg = pysbd.segmenter.Segmenter(language='en', clean=True)
print(seg.segment(text))
['He said "hello. And then world."']

I was expecting

[
    'He said', 
    '"hello.',
    'And then world."'
]

It almost does the right thing when using single quotes. Almost. The sentence is split correctly, but it considers the terminating single quote as a sentence

import pysbd

text = "He said 'hello. And then world.'"
seg = pysbd.segmenter.Segmenter(language='en', clean=True)
print(seg.segment(text))
[
    "He said 'hello.",
    'And then world.',
    "'"]