Parallel CountVectorizer

Follow up on #20

This adds parallel token counting in CountVectorizer using rayon. Only the two-step ingestion is parallelized, i.e.,

In a first pass, extract vocabulary from the corpus
```
 vect = CountVectorizer().fit(data)
```
In a second pass, extract tokens from the corpus given an existing vocabulary
```
 X = vect.transform(data)
```

The one-pass ingestion currently done with CountVectorizer.fit_transform is still single threaded. There the problem is more difficult as the vocabulary needs to be shared between threads as it is constructed and tokens extracted. The two pass ingestion will typically be faster starting from 4 CPU cores (cf benchmarks) below but its limitations is that it requires,

loading the data twice (or alternatively keeping it all in memory)
it loads all the processed tokens in memory when n_jobs>1 are used.

A known bug is that for n_jobs>1 all CPU cores will be used irrespective of n_jobs value.

In the future, it there we should optionally allow two pass ingestion for fit_transform as well.

Benchmarks

benchmarks/bench_vectorizers.py 
# vectorizing 19924 documents:
     HashingVectorizer(n_jobs=1).transform [vtext]: 1.05s [86.7 MB/s], shape=(19924, 1048576), nnz=3961670
     HashingVectorizer(n_jobs=4).transform [vtext]: 0.30s [304.7 MB/s], shape=(19924, 1048576), nnz=3961670
      HashingVectorizer().transform [scikit-learn]: 5.24s [17.4 MB/s], shape=(19924, 1048576), nnz=4177915
             CountVectorizer(n_jobs=1).fit [vtext]: 0.88s [103.7 MB/s], shape=None, nnz=None
             CountVectorizer(n_jobs=4).fit [vtext]: 0.40s [226.8 MB/s], shape=None, nnz=None
       CountVectorizer(n_jobs=1).transform [vtext]: 1.10s [82.5 MB/s], shape=(19924, 208706), nnz=3962338
       CountVectorizer(n_jobs=4).transform [vtext]: 0.32s [287.1 MB/s], shape=(19924, 208706), nnz=3962338
           CountVectorizer().fit_transform [vtext]: 1.31s [69.5 MB/s], shape=(19924, 208706), nnz=3962338
    CountVectorizer().fit_transform [scikit-learn]: 6.24s [14.6 MB/s], shape=(19924, 208706), nnz=3962338

(in scikit-learn fit, transform, and fit_transform takes a comparable time for CountVectorizer)

rth / vtext

Parallel CountVectorizer #55