This PR fixes an issue with prediction data not using cluster_selection_epsilon. This bug surfaces with wrong predictions from approximate_predict and incorrect exemplars_.
Code to reproduce the problem:
import hdbscan
from sklearn.datasets import make_blobs
blobs, _ = make_blobs(100, n_features=8, centers=10, random_state=42)
# use a high epsilon to force fewer clusters. real world data this happens more easily
clusterer = hdbscan.HDBSCAN(cluster_selection_epsilon=12.0, prediction_data=True)
clusterer.fit(blobs)
# 7 clusters from labels
clusterer.labels_.max() + 1
# 10 clusters from exemplars
len(clusterer.exemplars_)
# [5, 4, 3, 0, 5, 5, 6, 0, 5, 1]
clusterer.labels_[:10]
# predicting assigns points to completely different clusters (and number of clusters!)
# [6, 5, 4, 0, 6, 6, 9, 0, 6, 2]
hdbscan.approximate_predict(clusterer, blobs[:10])
I tracked the issue down to prediction data selecting the clusters from the tree differently to how it's done in _hdbscan_tree.pyx. The fix is to return the selected clusters from get_clusters in _hdbscan_tree.pyx and use the same clusters for prediction.
This PR fixes an issue with prediction data not using
cluster_selection_epsilon
. This bug surfaces with wrong predictions fromapproximate_predict
and incorrectexemplars_
.Code to reproduce the problem:
I tracked the issue down to prediction data selecting the clusters from the tree differently to how it's done in
_hdbscan_tree.pyx
. The fix is to return the selected clusters fromget_clusters
in_hdbscan_tree.pyx
and use the same clusters for prediction.With this PR:
This likely fixes #308