Closed sven-h closed 2 years ago
Thanks. SimpleSentenceSplitter
leverages regex, which runs slow on very long strings. The contract of this API assumes that the input is normal English text. Your use case doesn't fit. I suggest that you do some safeguard check before calling this API. Although we may do some check internally, it will cause unnecessary overhead for most other users.
Okay, just wanted to let you know.
SimpleSentenceSplitter
has a bad runtime performance when processing messy texts which do not contain any sentence boundaries. I run some tests with the provided snippet below:In case there is some text without sentence boundaries, preprocessing 5000 characters takes over 7 minutes. When the text length increases, the runtime get worse.
If some sentence boundaries are contained, then everything works within milliseconds.
Expected behavior Also a fast sentence splitting for texts without sentence boundaries.
Actual behavior Runtimes over 7 minutes for 5000 characters.
Code snippet
Additional context