[QST] generate big data (~300GB) with cuml.dask.datasets.make_blobs

NightMachinery commented 2 years ago

I want to test clustering about 300GB of data with cuML. The data itself is not ready yet, and I need to use some synthetic data to see if the dimension reduction and clustering algorithms can work with such an amount or not.

The way I have thought about doing this is:

Generate the centers separately.
Use make_blobs with the generated centers with different random seeds to generate the different 'partitions' of the data.
Use the generated partition to train the model, then delete it.
Repeat.

Does this make sense?

Then, as far as I understand, my only options in reducing the dimensions of the data (to make it smaller so that it might fit on the memory) are random projections and incremental PCA. Should I just call partial_fit on each data partition? Or should I select random, small (~1024 samples) minibatches from the partition and call partial_fit on those? Or perhaps I can just call fit consecutively on each partition?

Is there any clustering algorithm that can train out-of-core? If not, does that mean that my only bet is dimension reduction?

https://github.com/rapidsai/cuml/issues/1262

NightMachinery commented 2 years ago

I figured out the generation part:

import numpy
np = numpy

import dask_ml
import dask_ml.cluster
from dask_ml.datasets import make_blobs as dask_make_blobs

X, y = dask_make_blobs(
    n_samples=10**6,
    n_features=10**4,
    centers=10,
    chunks=(10**4, 10**4))
X = X.astype(np.float32)
y = y.astype(np.float32)

Is such a Dask array usable with cuML? Can it work on the GPU? (Assuming both the main memory and the GPU memory are smaller than X.)

I tried the following, but it does not seem to work:

import cupy

X = X.map_blocks(cupy.asarray)
##
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask

cluster = LocalCUDACluster()
client = Client(cluster)
##
from cuml.dask.cluster import KMeans as cuKMeans

km_cu = cuKMeans(n_clusters=10)
km_cu.fit(X)

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml

[QST] generate big data (~300GB) with cuml.dask.datasets.make_blobs #4494