nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms #68

Closed nipunsadvilkar closed 4 years ago

nipunsadvilkar commented 4 years ago

Segmentation Tools, Libraries and Algorithms:

Tool Accuracy Speed (ms)
blingfire 75.00% 49.91
pySBD 97.92% 2449.18
syntok 68.75% 783.73
spaCy 52.08% 473.96
stanza 72.92% 120803.37
NLTK 56.25% 342.98
nipunsadvilkar commented 4 years ago

@DeNeutoy Can you suggest any dataset to benchmark against?

DeNeutoy commented 4 years ago

@nipunsadvilkar Perhaps a book from Project Gutenberg? They have full plaintext books, e.g: http://www.gutenberg.org/files/1661/1661-0.txt

This would allow us to also analyse failure cases of the various methods also.

DeNeutoy commented 4 years ago

Here is another alg to benchmark - https://github.com/microsoft/BlingFire#python-api-description

Blingfire is very fast, but I don't know how good their sbd module is.

nipunsadvilkar commented 4 years ago

@DeNeutoy : Benchmarked blingfire. Quiet amazed by its speed & accuracy 💯

nipunsadvilkar commented 4 years ago

Going with @DeNeutoy approach #69