rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
148 stars 11 forks source link

Snowball stemmer #22

Closed rth closed 5 years ago

rth commented 5 years ago

This adds a Python wrapper for the rust_stemmers crate for the Snowball algorithm.

The results is around 15-20x faster than NLTK,

$ python3.7 ../benchmarks/bench_stemmers.py 
# stemming 1000 documents
                    nltk.stem.PorterStemmer(): 7.18s [0.05 M tokens/s]
         nltk.stem.SnowballStemmer('english'): 5.31s [0.07 M tokens/s]
          nltk.stem.SnowballStemmer('french'): 10.68s [0.04 M tokens/s]
pytext_vectorize.stem.SnowballStemmer('english'): 0.37s [1.05 M tokens/s]
pytext_vectorize.stem.SnowballStemmer('french'): 0.48s [0.82 M tokens/s]