Enable whitespace-preserving splitting

This PR introduces a new flag strip_whitespace for the existing split() method and a new method boundaries.

The new features are best described by example:

>>> from sentence_splitter import SentenceSplitter
>>> sbd = SentenceSplitter('en')
>>>
>>> # Inherited behaviour (unchanged):
>>> sbd.split("A brief note.   Another   one.\tAnd a final one.\n")
['A brief note.', 'Another one.', 'And a final one.']
>>>
>>> # New flag strip_whitespace (default: True):
>>> sbd.split("A brief note.   Another   one.\tAnd a final one.\n", strip_whitespace=False)
['A brief note.   ', 'Another   one.\t', 'And a final one.\n']
>>>
>>> # New method boundaries():
>>> sbd.boundaries("A brief note.   Another   one.\tAnd a final one.\n")
[16, 31]

The changes in prose:

If the new flag strip_whitespace is True (the default), leading, trailing and duplicated whitespace is stripped (current behaviour). If the flag is False, all whitespace is preserved, such that "".join(sbd.split(text, strip_whitespace=False)) == text
The new method SentenceSplitter.boundaries(text: str) -> List[int] returns a list of character offsets into the original string, denoting sentence boundaries. The number of boundaries is always less than the number of sentences by one (except for an empty string, in which case both methods return an empty list).

The reason for the proposed changes is simple: the current implementation mixes two tasks, sentence boundary detection and whitespace normalisation, in an inseparable way. This makes this sentence splitter unusable in contexts where the original whitespace needs to be retained. This PR adds an option to perform non-destructive sentence splitting.

I tried to stick with the original code as much as possible in order not to change the behaviour of the splitter; in particular, the regular expressions have not been changed. The new and old implementations yielded identical results for a small corpus of 35k docs/60k sentences in EN/DE/ES/FR/IT/PL. However, this doesn't guarantee that the new implementation behaves exactly the same in all edge cases. A test suite for typical and interesting cases in some (or all) languages supported by this sentence splitter could help gaining some confidence here.

mediacloud / sentence-splitter

Enable whitespace-preserving splitting #8