rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

[BUG] Multi-GPU DBSCAN is broken #6110

Open vikcost opened 1 month ago

vikcost commented 1 month ago

Below is a minimal version of a test script for multi-gpu DBSCAN. I have 6 RTX 4090 on my machine that I want to utilize.

I observe memory allocations and de-allocations on my GPUs. But DBSCAN fails to return any result.

Any idea where the issue might be coming from?

import numpy as np
from cuml.dask.cluster import DBSCAN
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

if __name__ == "__main__":
    cluster = LocalCUDACluster()
    client = Client(cluster)
    embs = np.random.randn(100_000, 256)
    dbscan = DBSCAN(
        client=client,
        eps=0.25,
        min_samples=5,
        metric="cosine",
    ).fit(embs)

Environment details:

divyegala commented 1 month ago

@vikcost can you explain what you mean by DBSCAN failing to return any result? Does that mean there is a crash or something else going on?

vikcost commented 2 weeks ago

@divyegala

After reviewing and waiting for longer, I see DBSCAN returning clustering results on datasets of 1_000_000 data points. However, I expect to get quicker performance by setting rmm_pool_size="24GB", but computation time slightly increased from 303 sec to 312 sec.

cluster = LocalCUDACluster(protocol="ucx", rmm_pool_size="24GB")

It's unexpected, provided that RMM is designed for advanced memory management. Am I setting these hyperparameters in a wrong way?

vikcost commented 2 weeks ago

However, when I run clustering on 5_000_000 data points, I don't see typical log outputs, as below:

[W] [22:35:28.663183] Batch size limited by the chosen integer type (4 bytes). 3998 -> 2147. Using the larger integer type might result in better performance
[W] [22:35:32.380082] Batch size limited by the chosen integer type (4 bytes). 3998 -> 2147. Using the larger integer type might result in better performance
...

Also, GPU-utilization is 0% and script doesn't show any signs of activity. How would one estimate a run time of a multi-GPU DBSCAN as a function of number of data points?