Open NightMachinery opened 2 years ago
I figured out the generation part:
import numpy
np = numpy
import dask_ml
import dask_ml.cluster
from dask_ml.datasets import make_blobs as dask_make_blobs
X, y = dask_make_blobs(
n_samples=10**6,
n_features=10**4,
centers=10,
chunks=(10**4, 10**4))
X = X.astype(np.float32)
y = y.astype(np.float32)
Is such a Dask array usable with cuML? Can it work on the GPU? (Assuming both the main memory and the GPU memory are smaller than X
.)
I tried the following, but it does not seem to work:
import cupy
X = X.map_blocks(cupy.asarray)
##
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask
cluster = LocalCUDACluster()
client = Client(cluster)
##
from cuml.dask.cluster import KMeans as cuKMeans
km_cu = cuKMeans(n_clusters=10)
km_cu.fit(X)
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
I want to test clustering about 300GB of data with cuML. The data itself is not ready yet, and I need to use some synthetic data to see if the dimension reduction and clustering algorithms can work with such an amount or not.
The way I have thought about doing this is:
make_blobs
with the generated centers with different random seeds to generate the different 'partitions' of the data.Does this make sense?
Then, as far as I understand, my only options in reducing the dimensions of the data (to make it smaller so that it might fit on the memory) are random projections and incremental PCA. Should I just call
partial_fit
on each data partition? Or should I select random, small (~1024 samples) minibatches from the partition and callpartial_fit
on those? Or perhaps I can just callfit
consecutively on each partition?Is there any clustering algorithm that can train out-of-core? If not, does that mean that my only bet is dimension reduction?
Related: