vanderlee / php-sentence

Simple text sentence splitting and counting. Supports atleast english, german and dutch, possibly more. If you find it works well enough for your language, please let me know!
MIT License
78 stars 23 forks source link

Splitting on abbreviations should connect to next, not previous #2

Closed marktaw closed 7 years ago

marktaw commented 7 years ago

Thank you for this code.

I've changed the logic in abbreviationMerge so that capitalized abbreviations stay with the next, not previous fragment.

E.g.

Last week, former director of the F.B.I. James B. Comey was fired. Mr. Comey was not available for comment.

Now splits neatly into

[0] => Last week, former director of the F.B.I. James B. Comey was fired.
[1] =>  Mr. Comey was not available for comment.

Where previously it split into

[0] => Last week, former director of the F.B.I.
[1] =>  James B.
[2] =>  Comey was fired. Mr.
[3] =>  Comey was not available for comment.

My revisions are in

https://github.com/vanderlee/php-sentence/compare/master...marktaw:master

marktaw commented 7 years ago

Update - I've also added functions to attach close quotes to the previous section, and a function to turn unicode quotes into straight quotes... obviously this can be generalized to cover unicode quotes, but I need quotes to be straight quotes & not unicode quotes anyway so....

vanderlee commented 7 years ago

Thanks for the fix. This greatly improves the accuracy.

I've added testcases (based on your examples, but anonymized) for this fix and also for the cleanupUnicode addition. Additionally, this fixed allowed me to reinstate some "incomplete" tests, so again; thanks!

I may take a look at cleanupUnicode in the future, as split() currently returns the cleaned up version, not with original quotes, but split() is really an afterthought for this project; count() is the one that needs to be correct.

It's now available from Composer as v1.0.3