rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

Clustering problems with NLP and CUML #5805

Open erico-imgproj opened 8 months ago

erico-imgproj commented 8 months ago

What is your question? During processing of a large NLP dataset I found an very good example on cuml documentation site example. Following its instructions I wrote my own version for my dataset. My dataset contains 6 million phrases, and I wish to run a clustering algorithm to begin testing.

import dask
dask.config.set(**{'array.slicing.split_large_chunks': True})
import cupy as cp
import cudf
from cuml.dask.common import to_sparse_dask_array
from cuml.feature_extraction.text import CountVectorizer
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)
n_workers = 4
cluster.scale(n_workers)

workers = client.has_what().keys()

fulldf=cudf.read_parquet('phrases.parquet')
fulldf = fulldf[~fulldf.origphrase.isna()]
fulldf.origphrase = fulldf.origphrase.astype(str)

cv = CountVectorizer()
X_tfidf = cv.fit_transform(fulldf['origphrase']).astype(cp.float32)
X = to_sparse_dask_array(X_tfidf, client)

from cuml.dask.cluster import KMeans

kmeans_float = KMeans(n_clusters=51)
yhat = kmeans_float.fit_predict(X)#Line where error happens

After preprocessing the data, the X variable is of type

dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>

which is the same type that the rapids example presents. Unfortunately, I run into a problem when I load it into the KMeans model.

2024-03-15 12:49:50,008 - distributed.worker - WARNING - Compute Failed
Key:       _func_fit-3f8c5c07-bc70-4821-9613-bc8545faf086
Function:  _func_fit
args:      (b'\x99\xab\xfbBR\xf5M\xd1\x91v\x1f9\x0et\xaa\xd3', [<cupyx.scipy.sparse._csr.csr_matrix object at 0x7f134f8825f0>], 'cupy', False)
kwargs:    {'n_clusters': 51, 'verbose': False}
Exception: 'AttributeError("\'NoneType\' object has no attribute \'shape\'")'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 198, in fit_predict
    return self.fit(X, sample_weight=sample_weight).predict(
  File "/home/erico/lab/packages_dask/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 175, in fit
    wait_and_raise_from_futures(kmeans_fit)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 164, in wait_and_raise_from_futures
    raise_exception_from_futures(futures)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 152, in raise_exception_from_futures
    raise RuntimeError(
RuntimeError: 1 of 1 worker jobs failed: 'NoneType' object has no attribute 'shape'

If I try to run yhat = kmeans_float.fit_predict(X.compute()) the error changes to

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 198, in fit_predict
    return self.fit(X, sample_weight=sample_weight).predict(
  File "/home/erico/lab/packages_dask/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/kmeans.py", line 154, in fit
    data = DistributedDataHandler.create(inputs, client=self.client)
  File "/home/erico/lab/packages_dask/cuml/dask/common/input_utils.py", line 108, in create
    datatype, multiple = _get_datatype_from_inputs(data)
  File "/home/erico/lab/packages_dask/cuml/dask/common/input_utils.py", line 193, in _get_datatype_from_inputs
    validate_dask_array(data)
  File "/home/erico/lab/packages_dask/cuml/dask/common/dask_arr_utils.py", line 34, in validate_dask_array
    if len(darray.chunks) > 2:
AttributeError: 'csr_matrix' object has no attribute 'chunks'

Changing the clustering algorithm also does not help. For instance, I tried the following code:

from cuml.dask.cluster import DBSCAN
model = DBSCAN(min_samples=5, gen_min_span_tree=True)
yhat = model.fit_predict(X)

And I get this error

Key:       _func-1d45cd66-2d8a-4525-9c2e-f576d861a7c5
Function:  _func
args:      (b'\\\x97\xd5\x9fJ\xd3@\xe3\x92\xc7Y\x00\xad\x8a\xb7\x0b', dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>)
kwargs:    {'min_samples': 5, 'gen_min_span_tree': True, 'verbose': False}
Exception: "ValueError('setting an array element with a sequence.')"

2024-03-15 12:54:36,754 - distributed.worker - WARNING - Compute Failed
Key:       _func-092bbae9-6a71-4b2e-b670-445edd0005e9
Function:  _func
args:      (b'\\\x97\xd5\x9fJ\xd3@\xe3\x92\xc7Y\x00\xad\x8a\xb7\x0b', dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>)
kwargs:    {'min_samples': 5, 'gen_min_span_tree': True, 'verbose': False}
Exception: "ValueError('setting an array element with a sequence.')"

2024-03-15 12:54:36,757 - distributed.worker - WARNING - Compute Failed
Key:       _func-9cc5519b-dda6-4208-8e76-a1a6f345c4ad
Function:  _func
args:      (b'\\\x97\xd5\x9fJ\xd3@\xe3\x92\xc7Y\x00\xad\x8a\xb7\x0b', dask.array<from-value, shape=(6261516, 232309), dtype=float64, chunksize=(6261516, 232309), chunktype=cupyx.csr_matrix>)
kwargs:    {'min_samples': 5, 'gen_min_span_tree': True, 'verbose': False}
Exception: "ValueError('setting an array element with a sequence.')"

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/dbscan.py", line 160, in fit_predict
    self.fit(X, out_dtype)
  File "/home/erico/lab/packages_dask/cuml/internals/memory_utils.py", line 87, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/home/erico/lab/packages_dask/cuml/dask/cluster/dbscan.py", line 133, in fit
    wait_and_raise_from_futures(dbscan_fit)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 164, in wait_and_raise_from_futures
    raise_exception_from_futures(futures)
  File "/home/erico/lab/packages_dask/cuml/dask/common/utils.py", line 152, in raise_exception_from_futures
    raise RuntimeError(
RuntimeError: 4 of 4 worker jobs failed: setting an array element with a sequence., setting an array element with a sequence., setting an array element with a sequence., setting an array element with a sequence.

Any help is appreciated

dantegd commented 8 months ago

Thanks for the issue! I'm not entirely sure what's happening, any chance you could run the script https://github.com/rapidsai/cuml/blob/branch-24.04/print_env.sh and post the output to see what versions of cuml/dask/etc you have, which will be super useful to reproduce.

erico-imgproj commented 8 months ago

Hello

Here is my configuration

cubinlinker-cu11          0.3.0.post1
cucim-cu11                24.2.0
cuda-python               11.8.3
cudf-cu11                 24.2.2
cugraph-cu11              24.2.0
cuml-cu11                 24.2.0
cuproj-cu11               24.2.0
cupy-cuda11x              13.0.0
cuspatial-cu11            24.2.0
cuxfilter-cu11            24.2.0
dask                      2024.1.1
dask-cuda                 24.2.0
dask-cudf-cu11            24.2.2
dask-glm                  0.3.2
dask-ml                   2023.3.24
raft-dask-cu11            24.2.0
rapids-dask-dependency    24.2.0

I hope it helps