scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 497 forks source link

Incremental clustering #76

Open Navein opened 7 years ago

Navein commented 7 years ago

Will there be incremental learning for HDBSCAN in the near future? Seems like a really good feature to have. Currently I'm using incremental hierarchical clustering and I wonder if this can also be done using HDBSCAN.

lmcinnes commented 7 years ago

Yes and no. I am currently working on a prediction approach which can assign new points to the existing clusters. There is an alternative approach that can approximate a full update that will generate new clusters, unfortunately this is computationally fairly expensive -- in practice it is actually cheaper to re-cluster from scratch with the new points (in part due to the fact that clustering is fairly cheap). So depending on which you actually need you may or may not be getting what you want in a while. What was your actual use case?

Navein commented 7 years ago

"I am currently working on a prediction approach which can assign new points to the existing clusters.". This would be a really nice feature, and it is something that I want for my current use case as well, which is clustering malware behaviors. I am looking for something which can add new clusters incrementally without changing the current clusters. What is the alternative approach that can approximate a full update?

lmcinnes commented 7 years ago

Incrementally adding new clusters without changing the current clusters is ... not really possible by any means that I know of. That would require significantly more thought. The catch is that new points can (quite reasonably) potentially change the existing clustering enough new points might connect two previously separate clusters, for example, or add two sufficiently dense regions within an existing cluster that it splits in two). If you allow new clusters to form then you should do that globally (otherwise you are just re-clustering noise points repetitively -- which I guess one could do, but I would advise against it) and that can ultimately change the clusters you have.

If you simply wish for less dramatic changes upon re-clustering with new points added then I would suggest you take the condensed tree as the result of clustering rather than the final flat clustering that hdbscan extracts. That should have smaller changes which you can map between in a sensible way.

On Tue, Dec 13, 2016 at 4:15 AM, Navein notifications@github.com wrote:

"I am currently working on a prediction approach which can assign new points to the existing clusters.". This would be a really nice feature, and it is something that I want for my current use case as well, which is clustering malware behaviors. I am looking for something which can add new clusters incrementally without changing the current clusters. What is the alternative approach that can approximate a full update?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/76#issuecomment-266686008, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBZrsuNJFz1ACqd1gV5kfiPkAo4J-ks5rHmJEgaJpZM4LK2k1 .

gtfuhr commented 1 year ago

Hello @lmcinnes, since the last answer was from 2016, do you have any new suggestions on this topic?

cat-cache commented 2 months ago

Hello @lmcinnes, since the last answer was from 2016, do you have any new suggestions on this topic?

I have found this for anyone looking for something similar, https://arxiv.org/abs/1910.07283