haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

Very long runtimes of SimpleSentenceSplitter when splitting long texts without sentence separators (messy text) #708

Closed sven-h closed 2 years ago

sven-h commented 2 years ago
Describe the bug The SimpleSentenceSplitter has a bad runtime performance when processing messy texts which do not contain any sentence boundaries. I run some tests with the provided snippet below: text length runtime (in seconds)
1000 3
2000 28
3000 95
4000 215
5000 422

In case there is some text without sentence boundaries, preprocessing 5000 characters takes over 7 minutes. When the text length increases, the runtime get worse.

If some sentence boundaries are contained, then everything works within milliseconds.

Expected behavior Also a fast sentence splitting for texts without sentence boundaries.

Actual behavior Runtimes over 7 minutes for 5000 characters.

Code snippet

String possibleCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
                                    "abcdefghijklmnopqrstuvxyz"+
                                    "0123456789";// +
                                    //",;.!? "; //TODO: add this line and everything works fast
Random random = new Random(1234);
StringBuilder sb = new StringBuilder();  
for (int i = 0; i < 30000; i++) {
    sb.append(possibleCharacters.charAt(random.nextInt(possibleCharacters.length())));
}
String text = sb.toString(); 
System.out.println("| text length | runtime (milliseconds) |");
for(int i=1000; i < text.length(); i+=1000){
    System.out.print("| " + i + " | ");
    long start = System.currentTimeMillis();
    String[] sentences = SimpleSentenceSplitter.getInstance().split(text.substring(0, i));
    long diff = System.currentTimeMillis() - start;
    System.out.println(diff + " |");
}

Additional context

haifengl commented 2 years ago

Thanks. SimpleSentenceSplitter leverages regex, which runs slow on very long strings. The contract of this API assumes that the input is normal English text. Your use case doesn't fit. I suggest that you do some safeguard check before calling this API. Although we may do some check internally, it will cause unnecessary overhead for most other users.

sven-h commented 2 years ago

Okay, just wanted to let you know.