[TASK] Post HDBSCAN merge tasks

divyegala commented 3 years ago

These are tasks for cuML's HDBSCAN implementation after 21.06 release

[x] Move HDBSCAN out of experimental (this can be done after sections 1, 2, and 3 below are complete).

1. Necessary tech debt / cleanup (e.g. need to have)

[x] Reduce number of clusters in output struct after cluster selection occurs (https://github.com/rapidsai/cuml/pull/3987)
[x] Output stability scores (e.g. persistent clusters Python attribute) (https://github.com/rapidsai/cuml/pull/3987)
[x] Use lazy-loaded / @property for plotting tools in estimator. (https://github.com/rapidsai/cuml/pull/3986)

2. Testing / Correctness verification

[x] Verify cluster_selection_method=eom for empty cluster tree and allow_single_cluster=True
[x] Add cluster condensing gtests with contrived examples (https://github.com/rapidsai/cuml/pull/4004)
[x] #4042
[x] Allow reference HDBSCAN's CondensedTree object to be accepted to cuML and end-to-end run extract_clusters from that object. This is to help with robust testing of HDBSCAN by helping ignore divergences from KNN and MST solutions (#4009)
[x] Finish complete testing of cluster_selection_method=leaf and cluster_selection_epsilon != 0.0 with the help of above (#4009)

3. Test failures / bugs (e.g. must have)

[x] Cluster sizes for some tests seem to differ only on A100 ( see related issue) (#4024)
[x] Intermittent crash that has appeared a couple times (see related issue) (#4025)
[x] https://github.com/rapidsai/cuml/issues/4054

4. Additional tech debt / cleanup (e.g. nice to have)

[ ] Some arrays are being used as int instead of bool due to inter-op issues between host and device bool. Update these
[ ] Investigate potential parallelization of do_labelling()

5. External

[ ] Submit patch to reference HDBSCAN for https://github.com/scikit-learn-contrib/hdbscan/issues/476 and update cuML when patch is accepted

6. Additional features before blog

[ ] #4815
[ ] #4814
[x] "Official" benchmarking for blog

7. Additional features (e.g. like to have)

[x] Add outlier scores
[ ] Sparse inputs
[x] Fuzzy clustering

cjnolet commented 3 years ago

Linking https://github.com/rapidsai/cuml/issues/3997

cjnolet commented 3 years ago

HDBSCAN has officially moved out of experimental!

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

KukumavMozolo commented 4 months ago

H there, i would be very much interested in sparse inputs being supported, is this planned?

beckernick commented 3 months ago

Hi @KukumavMozolo, thanks for reviving this issue! Would you be able to share any info about what kinds of use cases this might enable for you that don't currently work well with dense inputs?

KukumavMozolo commented 3 months ago

Hi @beckernick, Currently I am working on crashreport deduplication. Essentially this entails transforming various device metrics like memory consumption, cpu utilization but also crashlogs like calltraces and register content into a very high dimensional vectorspace that is also very sparse. Crashreport-deduplication is than the process of finding Clusters in that vector space representing sources of errors. Obtaining a dense representation of this kind of data seems difficult to me since slight variations of input e.g. a slightly different stacktrace can be a qualitative different source of error making it hard to compress. Yet still the number of possible errors at a given point in time is comparably small.

rapidsai / cuml