Open erico-imgproj opened 6 months ago
Thanks for the issue @erico-imgproj! I notice that you're using from sklearn.feature_extraction.text import CountVectorizer
, have you tried using cuML's CountVectorizer? Have you run into issues running with it?
We'll also work on solving the issue you're seeing with Scikit's CountVectorizer, it should work as well.
Hi @dantegd I tested both, but the error is still there. The lines up to the generation of the features come from an example available at the CUML website. It should work. The issue seems to be that the output of the CountVectorizer nor TFIDFVectorizer are not recognized by the clustering algorithms. If you try to run classification tasks, they will work fine.
This also happens for CUML's RandomForest. However, models like Naive Bayes and SVC do work in the same setup. Is there a specific reason why RF can't deal with the csr_matrix? Here is a minimal example, I tried both CountVectorizer and HashingVectorizer (CUML and SKlearn).
import time
import cudf
import cupy as cp
import numpy as np
# from xgboost import XGBClassifier
from cuml.dask.common import to_sparse_dask_array
from cuml.ensemble import RandomForestClassifier
# from dask_ml.feature_extraction.text import HashingVectorizer
from cuml.feature_extraction.text import CountVectorizer, HashingVectorizer
# from cuml.dask.naive_bayes import MultinomialNB as cuNB
from cuml.naive_bayes import MultinomialNB as cuNB
from cuml.svm import SVC as cuSVC
from cupyx.scipy.sparse import csr_matrix
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from sklearn.datasets import fetch_20newsgroups
# Create a local CUDA cluster
cluster = LocalCUDACluster()
client = Client(cluster)
# Load corpus
twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42)
twenty_train = cudf.DataFrame.from_dict(
{"data": twenty_train.data, "target": twenty_train.target}
)
cv = HashingVectorizer()
xformed = cv.fit_transform(twenty_train.data).astype(np.float32)
X = csr_matrix(xformed).astype(cp.float32)
y = cp.asarray(twenty_train.target).astype(cp.int32)
from cuml.ensemble import RandomForestClassifier as cuRF
# Try NB
model = cuNB()
start = time.time()
model.fit(X, y) # works
end = time.time()
print("Time to train: ", end - start)
# Try RF
model = cuRF()
start = time.time()
model.fit(X, y) # fails
end = time.time()
print("Time to train: ", end - start)
I get the same errors as @erico-imgproj.
Any news on this?
Unfortunately, no news to report
Thanks for the quick reply!
As a sidenote I can run the above code if I force X
to be a dense vector.
def to_dense(X):
return X.toarray()
pipeline = Pipeline([("vect", CountVectorizer()), ("tfidf", TfidfTransformer()), ("to_dense", FunctionTransformer(to_dense)), ("clf", model)])
Which at least achieves the same predictive accuracy as the scikit-learn version, but is MUCH slower. --> 3,814s vs 125s due to parallelism on the CPU.
Edit: comma instead of dot to signify 3 thousand seconds instead of 3 seconds.
Well, this approach is valid if you can load the data in memory. In my case, I can't...
I wouldn't call it valid since it takes longer than on a CPU (see my edit). I also quickly run into OOM with most of the datasets I have, this was just the smallest irl dataset I had on hand.
If you want to scale ideally on multiple GPUs, I would recommend using the HashingVectorizer
as a replacement to the CountVectorizer
. It should yield good results while being stateless / embarrassingly-parallel. This would produce a sparse array that is split on multiple GPUs allowing the use of more GPU memory.
from dask.dataframe import from_pandas
X = from_pandas(twenty_train) # might have to tweak a few things to extract data
def vectorize(df, stop_words, ngram_range):
hv = HashingVectorizer(stop_words=stop_words, ngram_range=ngram_range)
return hv.fit_transform(df)
X.map_partitions(vectorize, self.stop_words, self.ngram_range)
Then there's the issue of some of our estimators not being able to take in sparse inputs. There does not seem to be an immediate solution for this. To densify your sparse Dask array, you can use todense()
, the estimator should then run. If the use of multiple GPUs is not sufficient you should take a representative sample of your dataset to train the final estimator.
Thanks, I tried HashingVectorizer
previously but todense
leads to OOM for me. I'm already using subsamples in a 5-fold cross-validation setup, but admittedly I'm only allocating 12GB of VRAM for my small 14k row dataset. It also doesn't matter whose HashingVectorizer
I use, still leads to OOM.
There is an alternative solution which is simply to reduce the value for the number of features (n_features
default=2**20). I recommend using cuML's HashingVectorizer
(instead of DaskML) for GPU support, but in multi-GPU fashion as demonstrated in my snippet.
Describe the bug NLP clustering does not work properly. The code available in example works fine for classification tasks, but the clustering does not accept the required the output from classes like CountVectorizer or TfidfVectorizer.
This error is also happening when executing PCA on the results of CountVectorizer or TfidfVectorizer.
Steps/Code to reproduce bug
The code above works fine but, when the clustering algorithm is called it returns the following error
In the case of PCA, the code added before the clustering task is the following
and the error generated is
Expected behavior The clustering and the PCA algorithms should return the clusters in a list, and another tabular structure data set for post processing.
Environment details (please complete the following information):
Additional context This error is related to this first mention #5805