This adds parallel token counting in CountVectorizer using rayon. Only the two-step ingestion is parallelized, i.e.,
In a first pass, extract vocabulary from the corpus
vect = CountVectorizer().fit(data)
In a second pass, extract tokens from the corpus given an existing vocabulary
X = vect.transform(data)
The one-pass ingestion currently done with CountVectorizer.fit_transform is still single threaded. There the problem is more difficult as the vocabulary needs to be shared between threads as it is constructed and tokens extracted. The two pass ingestion will typically be faster starting from 4 CPU cores (cf benchmarks) below but its limitations is that it requires,
loading the data twice (or alternatively keeping it all in memory)
it loads all the processed tokens in memory when n_jobs>1 are used.
A known bug is that for n_jobs>1 all CPU cores will be used irrespective of n_jobs value.
In the future, it there we should optionally allow two pass ingestion for fit_transform as well.
Follow up on #20
This adds parallel token counting in
CountVectorizer
using rayon. Only the two-step ingestion is parallelized, i.e.,The one-pass ingestion currently done with
CountVectorizer.fit_transform
is still single threaded. There the problem is more difficult as the vocabulary needs to be shared between threads as it is constructed and tokens extracted. The two pass ingestion will typically be faster starting from 4 CPU cores (cf benchmarks) below but its limitations is that it requires,n_jobs>1
are used.A known bug is that for
n_jobs>1
all CPU cores will be used irrespective ofn_jobs
value.In the future, it there we should optionally allow two pass ingestion for
fit_transform
as well.Benchmarks
(in scikit-learn fit, transform, and fit_transform takes a comparable time for CountVectorizer)