rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Sentence tokenizers benchmarks #71

Closed rth closed 4 years ago

rth commented 4 years ago

Add benchmarks for sentence tokenizers, including https://github.com/rth/vtext/pull/70 . The current output is,

$ python benchmarks/bench_sentence_tokenizers.py 
# Tokenizing 19924 documents
           Python re.split('(?<=[!.?])', ...): 0.74s 123.6 MB/s, 1589555 sentences
                   UnicodeSentenceTokenizer(): 2.16s 42.1 MB/s, 1396894 sentences
                       PunctuationTokenizer(): 0.50s 182.2 MB/s, 1585005 sentences

See benchmarks/README.md for downloading the 20 newsgroups dataset used in benchmarks.

Will merge on green CI to make working on https://github.com/rth/vtext/pull/70 easier cc @joshlk

Edit: earlier version of the above results was using debug build by mistake.