Sentence tokenizers benchmarks

Add benchmarks for sentence tokenizers, including https://github.com/rth/vtext/pull/70 . The current output is,

$ python benchmarks/bench_sentence_tokenizers.py 
# Tokenizing 19924 documents
           Python re.split('(?<=[!.?])', ...): 0.74s 123.6 MB/s, 1589555 sentences
                   UnicodeSentenceTokenizer(): 2.16s 42.1 MB/s, 1396894 sentences
                       PunctuationTokenizer(): 0.50s 182.2 MB/s, 1585005 sentences

See benchmarks/README.md for downloading the 20 newsgroups dataset used in benchmarks.

Will merge on green CI to make working on https://github.com/rth/vtext/pull/70 easier cc @joshlk

Edit: earlier version of the above results was using debug build by mistake.

rth / vtext

Sentence tokenizers benchmarks #71