rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.18k stars 527 forks source link

[TASK] Post HDBSCAN merge tasks #3879

Open divyegala opened 3 years ago

divyegala commented 3 years ago

These are tasks for cuML's HDBSCAN implementation after 21.06 release

1. Necessary tech debt / cleanup (e.g. need to have)

2. Testing / Correctness verification

3. Test failures / bugs (e.g. must have)

4. Additional tech debt / cleanup (e.g. nice to have)

5. External

6. Additional features before blog

7. Additional features (e.g. like to have)

cjnolet commented 3 years ago

Linking https://github.com/rapidsai/cuml/issues/3997

cjnolet commented 3 years ago

HDBSCAN has officially moved out of experimental!

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

KukumavMozolo commented 4 months ago

H there, i would be very much interested in sparse inputs being supported, is this planned?

beckernick commented 3 months ago

Hi @KukumavMozolo, thanks for reviving this issue! Would you be able to share any info about what kinds of use cases this might enable for you that don't currently work well with dense inputs?

KukumavMozolo commented 3 months ago

Hi @beckernick, Currently I am working on crashreport deduplication. Essentially this entails transforming various device metrics like memory consumption, cpu utilization but also crashlogs like calltraces and register content into a very high dimensional vectorspace that is also very sparse. Crashreport-deduplication is than the process of finding Clusters in that vector space representing sources of errors. Obtaining a dense representation of this kind of data seems difficult to me since slight variations of input e.g. a slightly different stacktrace can be a qualitative different source of error making it hard to compress. Yet still the number of possible errors at a given point in time is comparably small.