Open divyegala opened 3 years ago
HDBSCAN has officially moved out of experimental!
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
H there, i would be very much interested in sparse inputs being supported, is this planned?
Hi @KukumavMozolo, thanks for reviving this issue! Would you be able to share any info about what kinds of use cases this might enable for you that don't currently work well with dense inputs?
Hi @beckernick, Currently I am working on crashreport deduplication. Essentially this entails transforming various device metrics like memory consumption, cpu utilization but also crashlogs like calltraces and register content into a very high dimensional vectorspace that is also very sparse. Crashreport-deduplication is than the process of finding Clusters in that vector space representing sources of errors. Obtaining a dense representation of this kind of data seems difficult to me since slight variations of input e.g. a slightly different stacktrace can be a qualitative different source of error making it hard to compress. Yet still the number of possible errors at a given point in time is comparably small.
These are tasks for cuML's HDBSCAN implementation after 21.06 release
1. Necessary tech debt / cleanup (e.g. need to have)
2. Testing / Correctness verification
cluster_selection_method=eom
for empty cluster tree andallow_single_cluster=True
cluster_selection_method=leaf
andcluster_selection_epsilon != 0.0
with the help of above (#4009)3. Test failures / bugs (e.g. must have)
4. Additional tech debt / cleanup (e.g. nice to have)
int
instead ofbool
due to inter-op issues between host and devicebool
. Update thesedo_labelling()
5. External
6. Additional features before blog
7. Additional features (e.g. like to have)