rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.14k stars 526 forks source link

[BUG] Dask + UMAP does not work with numpy array. #5893

Open nahaharo opened 3 months ago

nahaharo commented 3 months ago

Describe the bug When Using Dask + UMAP to use multiple gpus, if a input array is np.array not cupy array, then dask error raises.

ValueError: could not broadcast input array from shape (7,1) into shape (7,)

If I cast the input array into cupy array, it runs without error.

Below is the code.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask.array as da
from cuml.manifold import UMAP
from cuml.dask.manifold import UMAP as MNMG_UMAP
import numpy as np
import cupy

if __name__ == "__main__":
    cluster = LocalCUDACluster(n_workers=2)
    client = Client(cluster)
    X = np.zeros((100, 10, 49), dtype=np.float32).reshape(100, -1)
    # X = cupy.asarray(X)
    print(X.shape, type(X))
    local_model = UMAP(random_state=10, n_components=1)
    val = local_model.fit_transform(X)

    distributed_model = MNMG_UMAP(model=local_model)
    distributed_X = da.from_array(X, chunks=(7, -1))
    embedding = distributed_model.transform(distributed_X)
    result = embedding.compute()
    client.close()
    cluster.close()

If I uncomment the "X = cupy.asarray(X)", then it runs without error.

dantegd commented 3 months ago

Thanks for the issue @nahaharo, this is useful feedback, it's something in the backlog, we will add it in the future, but no ETA currently.