scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

HDBSCAN flat returns more than `n_clusters` #561

Open KKJSP opened 2 years ago

KKJSP commented 2 years ago

Observed The HDBSCAN flat module that is documented here is supposed to return a fixed number of clusters controlled by the n_clusters parameter. I came across a sample where it returns more than the requested number of clusters.

Expected HDBSCAN flat must return exactly n_clusters for all inputs.

Code and data Here is a simple dataset for which HDBSCAN returns more than n_clusters -> data.csv

Here is the code

import pandas as pd
from hdbscan import flat
df = pd.read_csv("data.csv")
clustering = flat.HDBSCAN_flat(df, min_samples=2, min_cluster_size=2, n_clusters=3)
print(set(clustering.labels_))

This prints {0, 1, 2, 3, -1} i.e. four clusters 0, 1, 2, and 3.

traderjoesbrownielover commented 2 years ago

I would be willing to take this on. Can this be assigned to me please?