rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[FEA] AgglomerativeClustering: Support cut distance_threshold parameter #4056

Open cjnolet opened 3 years ago

cjnolet commented 3 years ago

Since the dendrogram is a binary tree, the current implementation of AgglomerativeClustering cuts the dendrogram at a particular level based on a user-provided parameter n_clusters. This can be useful when the user knows the number of clusters but makes it challenging in cases where the user might instead know a distance threshold and not the resulting number of clusters.

Supporting the distance_threshold parameter shouldn't be too hard. Rather than slicing the dendrogram at a particular level, clusters that fall below a particular distance threshold from each other are merged together to yield a final set of flattened clusters.

frankxu2004 commented 3 years ago

I would really love this feature. Actually, I am wondering if as a first step, we could allow Python API to return the dendrogram tree data structure (including the linkage value). I tried to dig into the code a bit and I imagine https://github.com/rapidsai/raft/blob/75656cee48b544caf609555f838eac39e68e3438/cpp/include/raft/sparse/hierarchy/detail/agglomerative.cuh#L120 holds the distance value that could be used as cut off?

I also see that children(https://github.com/rapidsai/raft/blob/75656cee48b544caf609555f838eac39e68e3438/cpp/include/raft/sparse/hierarchy/detail/agglomerative.cuh#L142) gets returned to the Python API, however, it would be great to know what's the (2, num_rows) children array's elements represents? https://github.com/rapidsai/cuml/blob/dd7cbf45c1d089ece7db0f1610d3cab775f3de02/python/cuml/cluster/agglomerative.pyx#L200 To me it seems that it's not the index of the rows of the dataset, but I imagine this would be a binary tree structure.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Z-Qing commented 1 year ago

Sorry to ask. Anyone slove this problem yet? I am having the same problem.

OasisArtisan commented 1 month ago

This would be great to have