Very long runtimes of SimpleSentenceSplitter when splitting long texts without sentence separators (messy text)

sven-h commented 2 years ago

Describe the bug The `SimpleSentenceSplitter` has a bad runtime performance when processing messy texts which do not contain any sentence boundaries. I run some tests with the provided snippet below:	text length	runtime (in seconds)
1000	3
2000	28
3000	95
4000	215
5000	422

In case there is some text without sentence boundaries, preprocessing 5000 characters takes over 7 minutes. When the text length increases, the runtime get worse.

If some sentence boundaries are contained, then everything works within milliseconds.

Expected behavior Also a fast sentence splitting for texts without sentence boundaries.

Actual behavior Runtimes over 7 minutes for 5000 characters.

Code snippet

String possibleCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
                                    "abcdefghijklmnopqrstuvxyz"+
                                    "0123456789";// +
                                    //",;.!? "; //TODO: add this line and everything works fast
Random random = new Random(1234);
StringBuilder sb = new StringBuilder();  
for (int i = 0; i < 30000; i++) {
    sb.append(possibleCharacters.charAt(random.nextInt(possibleCharacters.length())));
}
String text = sb.toString(); 
System.out.println("| text length | runtime (milliseconds) |");
for(int i=1000; i < text.length(); i+=1000){
    System.out.print("| " + i + " | ");
    long start = System.currentTimeMillis();
    String[] sentences = SimpleSentenceSplitter.getInstance().split(text.substring(0, i));
    long diff = System.currentTimeMillis() - start;
    System.out.println(diff + " |");
}

Additional context

What Java (OpenJDK, Orack JDK, etc.) are you using and which Java version: Oracle 1.8
Which Smile version: 2.6.0
What is your build system (e.g. Ubuntu, MacOS, Windows, Debian ): Windows

haifengl commented 2 years ago

Thanks. SimpleSentenceSplitter leverages regex, which runs slow on very long strings. The contract of this API assumes that the input is normal English text. Your use case doesn't fit. I suggest that you do some safeguard check before calling this API. Although we may do some check internally, it will cause unnecessary overhead for most other users.

sven-h commented 2 years ago

Okay, just wanted to let you know.

haifengl / smile

Very long runtimes of SimpleSentenceSplitter when splitting long texts without sentence separators (messy text) #708