vanderlee / php-sentence

Simple text sentence splitting and counting. Supports atleast english, german and dutch, possibly more. If you find it works well enough for your language, please let me know!
MIT License
78 stars 23 forks source link

Acronyms at the end of sentence are incorrectly parsed #13

Open dinamic opened 4 years ago

dinamic commented 4 years ago

The library has been really useful to us to break text into sentences. I've noticed one issue so far. Seems like if a sentence ends with an acronym at the end of the text, everything is okay, but if there's another sentence after it - it gives an incorrect result. It goes even worse if the acronym is capitalized.

Here it works fine:

$sentences = $sentenceBreaker->split('Let\'s meet at 10:00 a.m..', \Sentence::SPLIT_TRIM);

var_dump($sentences);
array(1) {
  [0] =>
  string(25) "Let's meet at 10:00 a.m.."
}

But fails in this one:

$sentences = $sentenceBreaker->split('Let\'s meet at 10:00 a.m.. How about Greg?', \Sentence::SPLIT_TRIM);

var_dump($sentences);
array(2) {
  [0] =>
  string(22) "Let's meet at 10:00 a."
  [1] =>
  string(19) "m.. How about Greg?"
}

Here it fails with a capitalized acronym:

$sentences = $sentenceBreaker->split('Let\'s meet at 10:00 A.M.. How about Greg?', \Sentence::SPLIT_TRIM);

var_dump($sentences);
array(1) {
  [0] =>
  string(41) "Let's meet at 10:00 A.M.. How about Greg?"
}
dinamic commented 4 years ago

@vanderlee would you be able to have a look into this one, please?