rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Parallel CountVectorizer #55

Closed rth closed 5 years ago

rth commented 5 years ago

Follow up on #20

This adds parallel token counting in CountVectorizer using rayon. Only the two-step ingestion is parallelized, i.e.,

  1. In a first pass, extract vocabulary from the corpus
     vect = CountVectorizer().fit(data)
  2. In a second pass, extract tokens from the corpus given an existing vocabulary
     X = vect.transform(data)

The one-pass ingestion currently done with CountVectorizer.fit_transform is still single threaded. There the problem is more difficult as the vocabulary needs to be shared between threads as it is constructed and tokens extracted. The two pass ingestion will typically be faster starting from 4 CPU cores (cf benchmarks) below but its limitations is that it requires,

A known bug is that for n_jobs>1 all CPU cores will be used irrespective of n_jobs value.

In the future, it there we should optionally allow two pass ingestion for fit_transform as well.

Benchmarks

benchmarks/bench_vectorizers.py 
# vectorizing 19924 documents:
     HashingVectorizer(n_jobs=1).transform [vtext]: 1.05s [86.7 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer(n_jobs=4).transform [vtext]: 0.30s [304.7 MB/s], shape=(19924, 1048576), nnz=3961670
      HashingVectorizer().transform [scikit-learn]: 5.24s [17.4 MB/s], shape=(19924, 1048576), nnz=4177915
             CountVectorizer(n_jobs=1).fit [vtext]: 0.88s [103.7 MB/s], shape=None, nnz=None
             CountVectorizer(n_jobs=4).fit [vtext]: 0.40s [226.8 MB/s], shape=None, nnz=None
       CountVectorizer(n_jobs=1).transform [vtext]: 1.10s [82.5 MB/s], shape=(19924, 208706), nnz=3962338
       CountVectorizer(n_jobs=4).transform [vtext]: 0.32s [287.1 MB/s], shape=(19924, 208706), nnz=3962338
           CountVectorizer().fit_transform [vtext]: 1.31s [69.5 MB/s], shape=(19924, 208706), nnz=3962338
    CountVectorizer().fit_transform [scikit-learn]: 6.24s [14.6 MB/s], shape=(19924, 208706), nnz=3962338

(in scikit-learn fit, transform, and fit_transform takes a comparable time for CountVectorizer)