rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
146 stars 11 forks source link

Parallel HashingVectorizer #20

Closed rth closed 5 years ago

rth commented 5 years ago

This is a first implementation of the parallel token counting using Rayon.

So far there are two issues,

Benchmarks with 2 CPU cores,

master

# vectorizing 19924 documents:
      HashingVectorizer (text-vectorize): 1.29s [70.3 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.69s [33.8 MB/s], shape=(19924, 208706), nnz=3962338

this PR Using 2 core CPU, so RAYON_NUM_THREADS=4 corresponds to hyperthreading,

$ RAYON_NUM_THREADS=1 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 2.11s [43.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 3.22s [28.3 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=2 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.42s [63.9 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.59s [35.2 MB/s], shape=(19924, 208706), nnz=3962338
$ RAYON_NUM_THREADS=4 python3.7 ../benchmarks/bench_vectorizers.py 
      HashingVectorizer (text-vectorize): 1.33s [68.2 MB/s], shape=(19924, 1048576), nnz=3961731
        CountVectorizer (text-vectorize): 2.48s [36.6 MB/s], shape=(19924, 208706), nnz=3962338

so the parallel scaling doesn't look so bad, it's more that the single-threaded implementation in this case is much slower than on master.

rth commented 5 years ago

In the end, this PR contains only the parallel version of the HashingVectorizer. CountVectorizer could be parallelized in a follow-up PR, the situation there is more complicated as it is not stateless and the vocabulary needs to be passed to different threads.

For HashingVectorizer, the scaling is reasonably good up to 8-16 CPU cores, after that we seem to reach the strong scaling limit, at least for this dataset. The maximum speed-up obtained is x5 of the scalar version. For n_jobs=1 we fall back to the non parallelized version,

# vectorizing 19924 documents:
     HashingVectorizer (scikit-learn): 4.75s [19.2 MB/s], shape=(19924, 1048576), nnz=4177915
     HashingVectorizer (vtext, n_jobs=1): 1.16s [78.2 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=2): 0.66s [137.4 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=4): 0.40s [227.0 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=8): 0.27s [340.0 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=16): 0.23s [400.7 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer (vtext, n_jobs=32): 0.23s [394.6 MB/s], shape=(19924, 1048576), nnz=3961670

tested on EC2 c4.8xlarge with 36 CPU cores, and loading files from a tmpfs to avoid disk IO limitations.

The only limitation is that currently n_jobs>1 uses all available CPU cores instead of the desired number. It could be adjusted at start time with RAYON_NUM_THREADS env variable. Fixing this would require using a local rayon thread pool instead of the global one. I have not found out how to do that with the current pipeline definition yet.