rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
146 stars 11 forks source link

Return iterator in tokenizers #19

Closed rth closed 5 years ago

rth commented 5 years ago

Return iterator in tokenizers which increases performance a bit.

Using the bench_tokenizers.py script, master

# Tokenizing 19924 documents
         Python re.findall(r'\b\w\w+\b', ...): 2.66s [34.2 MB/s]
                RegexpTokenizer(r'\b\w\w+\b'): 2.12s [42.9 MB/s]
   UnicodeSegmentTokenizer(word_bounds=False): 3.37s [27.0 MB/s]
    UnicodeSegmentTokenizer(word_bounds=True): 4.04s [22.5 MB/s]

this PR

# Tokenizing 19924 documents
         Python re.findall(r'\b\w\w+\b', ...): 2.57s [35.4 MB/s]
                RegexpTokenizer(r'\b\w\w+\b'): 1.93s [47.0 MB/s]
   UnicodeSegmentTokenizer(word_bounds=False): 3.11s [29.3 MB/s]
    UnicodeSegmentTokenizer(word_bounds=True): 3.72s [24.5 MB/s]