scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.95k stars 25.38k forks source link

add HDBSCAN #14331

Closed amueller closed 1 year ago

amueller commented 5 years ago

I think we should add HDBSCAN. the original paper is from 2013, @lmcinnes's accelerated version is from 2017, the original paper has 300 citations, the 2017 JOSS paper about the implementation has 100. I think that should fulfill our requirements, and it's commonly asked for.

@lmcinnes said he might not have time to move it so maybe someone else can pick it up.

For reference: https://github.com/scikit-learn-contrib/hdbscan

lmcinnes commented 5 years ago

I will be happy to provide assistance with moving it over -- there are some changes that will be required, mostly related to the difference between accessing internals of scikit-learn kd-trees via Cython. I will also be happy to help with reviewing.

jnothman commented 5 years ago

Should the release of optics play into that decision?

amueller commented 5 years ago

I'm really not that familiar with OPTICS. Looks like it might make the OPTICS implementation obsolete? https://datascience.stackexchange.com/a/11630

amueller commented 5 years ago

Btw I like the demo dataset for hdbscan, maybe it could replace some of the other ones we have in the comparison? https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

amueller commented 5 years ago

@jnothman I know you're catching up with a lot but maybe this is worth looking into, given that there's still issues with Optics and this is actually a pretty well-tested implementation?

rth commented 5 years ago

Looks like it might make the OPTICS implementation obsolete? [..] given that there's still issues with Optics

OPTICS was announced as a "major feature" in v0.21 so I guess it's now there to stay in any case? The 1999 paper also has 3.5k citations. Unless there are significant issues with it? I haven't found that many on issue tracker, but I haven't followed the development either.

I think it would be good to include HDBSCAN, just saying that purely following the inclusion criteria (independently of any technical merits of the algorithms) it made sense to include OPTICS first. Now what impact that may have on the future HDBSCAN inclusion I'm not sure.

amueller commented 5 years ago

@rth not sure I follow your logic. Are you talking about the class or the implementation or both? I am not very familiar with either algorithm, but it looks to me as if an implementation of HDBSCAN would also implement OPTICS, and having a redundant implementation of OPTICS seems unnecessary?

rth commented 5 years ago

I meant the OPTICS algorithm, not so much the implementation. I was not aware that OPTICS results could be obtained with HDBSCAN exactly. As long as we don't break backward compatibility of OPTICS I don't really have an opinion, and will let people who have worked on this decide..

adrinjalali commented 5 years ago

I haven't read the HDBSCAN's paper in detail, but as I understand, it's not strictly a superset of OPTICS, but it seems the community has accepted that it's a better one compare to OPTICS.

I don't think it'd be too hard to refactor the code so that both algorithms can use the core part.

lmcinnes commented 5 years ago

HDBSCAN and OPTICS share the same computational core (though HDBSCAN is a little more general); the post-processing can be a little different. I do think you want to look to re-use/integrate the core code if possible to improve stability, debugging, and maintenance.

Micky774 commented 1 year ago

Finally closed in https://github.com/scikit-learn/scikit-learn/pull/26385 😄