mediacloud / sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Other
230 stars 30 forks source link

Enable whitespace-preserving splitting #8

Open lfurrer opened 2 years ago

lfurrer commented 2 years ago

This PR introduces a new flag strip_whitespace for the existing split() method and a new method boundaries.

The new features are best described by example:

>>> from sentence_splitter import SentenceSplitter
>>> sbd = SentenceSplitter('en')
>>>
>>> # Inherited behaviour (unchanged):
>>> sbd.split("A brief note.   Another   one.\tAnd a final one.\n")
['A brief note.', 'Another one.', 'And a final one.']
>>>
>>> # New flag strip_whitespace (default: True):
>>> sbd.split("A brief note.   Another   one.\tAnd a final one.\n", strip_whitespace=False)
['A brief note.   ', 'Another   one.\t', 'And a final one.\n']
>>>
>>> # New method boundaries():
>>> sbd.boundaries("A brief note.   Another   one.\tAnd a final one.\n")
[16, 31]

The changes in prose:

The reason for the proposed changes is simple: the current implementation mixes two tasks, sentence boundary detection and whitespace normalisation, in an inseparable way. This makes this sentence splitter unusable in contexts where the original whitespace needs to be retained. This PR adds an option to perform non-destructive sentence splitting.

I tried to stick with the original code as much as possible in order not to change the behaviour of the splitter; in particular, the regular expressions have not been changed. The new and old implementations yielded identical results for a small corpus of 35k docs/60k sentences in EN/DE/ES/FR/IT/PL. However, this doesn't guarantee that the new implementation behaves exactly the same in all edge cases. A test suite for typical and interesting cases in some (or all) languages supported by this sentence splitter could help gaining some confidence here.