rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

[QST] generate big data (~300GB) with cuml.dask.datasets.make_blobs #4494

Open NightMachinery opened 2 years ago

NightMachinery commented 2 years ago

I want to test clustering about 300GB of data with cuML. The data itself is not ready yet, and I need to use some synthetic data to see if the dimension reduction and clustering algorithms can work with such an amount or not.

The way I have thought about doing this is:

Does this make sense?

Then, as far as I understand, my only options in reducing the dimensions of the data (to make it smaller so that it might fit on the memory) are random projections and incremental PCA. Should I just call partial_fit on each data partition? Or should I select random, small (~1024 samples) minibatches from the partition and call partial_fit on those? Or perhaps I can just call fit consecutively on each partition?

Is there any clustering algorithm that can train out-of-core? If not, does that mean that my only bet is dimension reduction?

Related:

NightMachinery commented 2 years ago

I figured out the generation part:

import numpy
np = numpy

import dask_ml
import dask_ml.cluster
from dask_ml.datasets import make_blobs as dask_make_blobs

X, y = dask_make_blobs(
    n_samples=10**6,
    n_features=10**4,
    centers=10,
    chunks=(10**4, 10**4))
X = X.astype(np.float32)
y = y.astype(np.float32)

Is such a Dask array usable with cuML? Can it work on the GPU? (Assuming both the main memory and the GPU memory are smaller than X.)

I tried the following, but it does not seem to work:

import cupy

X = X.map_blocks(cupy.asarray)
##
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask

cluster = LocalCUDACluster()
client = Client(cluster)
##
from cuml.dask.cluster import KMeans as cuKMeans

km_cu = cuKMeans(n_clusters=10)
km_cu.fit(X)
github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.