Open Hgherzog opened 1 year ago
I confirm the... bug? Not sure if it is a bug or intentional but
import pysbd
text = 'He said "hello. And then world."'
seg = pysbd.segmenter.Segmenter(language='en', clean=True)
print(seg.segment(text))
['He said "hello. And then world."']
I was expecting
[
'He said',
'"hello.',
'And then world."'
]
It almost does the right thing when using single quotes. Almost. The sentence is split correctly, but it considers the terminating single quote as a sentence
import pysbd
text = "He said 'hello. And then world.'"
seg = pysbd.segmenter.Segmenter(language='en', clean=True)
print(seg.segment(text))
[
"He said 'hello.",
'And then world.',
"'"]
When dealing with a long statement of facts quoted from legal text, the text is not split up within left double quotations and write double quotations. this is different than the " characterI cannot share the text here as it deals with sensitive content.
import pysbd seg = pysbd.Segmenter(language='en') sentences = seg.segment(above_text)
Returns a lot of length 1 and does not split by sentences. The expected behavior is to split up into sentences within the quotations.