Closed linas closed 8 years ago
The default Java class is claimed to work correctly: e.g. https://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html says
Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.
Never mind. Oracle is lying about java. The sentence "Dr. Smith is late." is split into two sentences by java; it thinks that "Dr." is a valid sentence.
The OpenNLP toolkit is used for only one situation: to perform sentence-splitting (detecting the boundaries of sentences). It is slightly more accurate than the default
java.text.BreakIterator
, but perhaps not enough of an improvement to be worth the extra effort?Its a hassle to install OpenNLP, and since the usage is so marginal, I'm thinking its just not worth the effort.