Closed rth closed 4 years ago
Add benchmarks for sentence tokenizers, including https://github.com/rth/vtext/pull/70 . The current output is,
$ python benchmarks/bench_sentence_tokenizers.py # Tokenizing 19924 documents Python re.split('(?<=[!.?])', ...): 0.74s 123.6 MB/s, 1589555 sentences UnicodeSentenceTokenizer(): 2.16s 42.1 MB/s, 1396894 sentences PunctuationTokenizer(): 0.50s 182.2 MB/s, 1585005 sentences
See benchmarks/README.md for downloading the 20 newsgroups dataset used in benchmarks.
benchmarks/README.md
Will merge on green CI to make working on https://github.com/rth/vtext/pull/70 easier cc @joshlk
Edit: earlier version of the above results was using debug build by mistake.
Add benchmarks for sentence tokenizers, including https://github.com/rth/vtext/pull/70 . The current output is,
See
benchmarks/README.md
for downloading the 20 newsgroups dataset used in benchmarks.Will merge on green CI to make working on https://github.com/rth/vtext/pull/70 easier cc @joshlk
Edit: earlier version of the above results was using debug build by mistake.