scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 507 forks source link

Wrong exemplars returned when using cluster_selection_epsilon (exemplars from eps=0 are returned) #593

Open lucetka opened 1 year ago

lucetka commented 1 year ago

When using a model with cluster_selection_epsilon within the effective range, the exemplars returned seem to be totally wrong - they are the exemplars that belong to the clusters produced before the eps is applied.

I think this issue is related also to another issue that I've asked about https://github.com/scikit-learn-contrib/hdbscan/issues/571 , ie that the condensed tree returned is always the eps=0 tree, without showing the new "superclusters" selected as a consequence of merging clusters + the points falling out at the specified eps level, and I've noticed that other related issues have been identified by others https://github.com/scikit-learn-contrib/hdbscan/pull/586. It would be great if this could be fixed.

Meanwhile, as an ultra-quick and very dirty workaround sufficient for my specific use, I map the labels from the clustering with epsilon to the clustering without, and for the newly emerged superclusters I simply use the exemplars from all the clusters from the eps=0 clustering that had been engulfed by the new supercluster (i.e. instead of 3 exemplars, I end up for e.g. with 6, which is in my case -- clustering documents -- not necessarily a bad thing as it also gives you an idea about the heterogeneity of the final cluster). However, I know this is not really correct because of course the resulting supercluster consists of more than just the engulfed clusters that had been selected in the eps=0 clustering - the supercluster of course also sucks in all the points previously discarded as noise at every split that happened above the applied eps level, and all these points (previously noise in the eps=0 clustering but now part of the cluster in the clustering with eps applied) are then not represented by the exemplars.

Edit: I realize I should have mentioned hdbscan 0.8.28 with Python 3.10.2 on Windows 10 64bit