rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

[NO MERGE] Putting data on host for HDBSCAN using mst optimize #6044

Open jinsolp opened 3 months ago

jinsolp commented 3 months ago

PR for future references of putting data on host for HDBSCAN so that it scales to large datasets. No reviews needed.

In reachability.cuh, currently using a optimize() function from cuvs to ensure connectedness for a knn graph. Note that cuML does not support building with cuvs yet, so the related functions are copy-pasted into mst_opt.cuh file.

Batching NND features and putting data on host supported by the implementation. Can be run like this;

hdbscan_nnd = HDBSCAN(min_samples=16, build_algo="nn_descent", build_kwds={'nnd_return_distances': True, "nnd_n_clusters": 4})
labels = hdbscan_nnd.fit(data, data_on_host=True).labels_