Parallel HashingVectorizer

rth / vtext

Simple NLP in Rust with Python bindings

Apache License 2.0

146 stars 11 forks source link

This is a first implementation of the parallel token counting using Rayon.

So far there are two issues,

it's not as fast as it could be because some unnecessary string copies were added in the tokenization pipeline to keep the borrow checker happy
Currently, we have to tokenize everything in a Vect, which requires a lot of memory. A workaround is proposed in https://users.rust-lang.org/t/parallel-work-collected-sequentially/13504/3

Benchmarks with 2 CPU cores,

master

# vectorizing 19924 documents:
      HashingVectorizer (text-vectorize): 1.29s [70.3 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.69s [33.8 MB/s], shape=(19924, 208706), nnz=3962338

this PR Using 2 core CPU, so RAYON_NUM_THREADS=4 corresponds to hyperthreading,

$ RAYON_NUM_THREADS=1 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 2.11s [43.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 3.22s [28.3 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=2 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.42s [63.9 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.59s [35.2 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=4 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.33s [68.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.48s [36.6 MB/s], shape=(19924, 208706), nnz=3962338

so the parallel scaling doesn't look so bad, it's more that the single-threaded implementation in this case is much slower than on master.

# vectorizing 19924 documents: HashingVectorizer (scikit-learn): 4.75s [19.2 MB/s], shape=(19924, 1048576), nnz=4177915 HashingVectorizer (vtext, n_jobs=1): 1.16s [78.2 MB/s], shape=(19924, 1048576), nnz=3961670 HashingVectorizer (vtext, n_jobs=2): 0.66s [137.4 MB/s], shape=(19924, 1048576), nnz=3961670 HashingVectorizer (vtext, n_jobs=4): 0.40s [227.0 MB/s], shape=(19924, 1048576), nnz=3961670 HashingVectorizer (vtext, n_jobs=8): 0.27s [340.0 MB/s], shape=(19924, 1048576), nnz=3961670 HashingVectorizer (vtext, n_jobs=16): 0.23s [400.7 MB/s], shape=(19924, 1048576), nnz=3961670 HashingVectorizer (vtext, n_jobs=32): 0.23s [394.6 MB/s], shape=(19924, 1048576), nnz=3961670

rth / vtext

Parallel HashingVectorizer #20