scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 506 forks source link

How to set cluster_selection_epsilon when using cosine distances? #627

Open ma9o opened 9 months ago

ma9o commented 9 months ago

Hi, I am using HDBSCAN to cluster text embeddings.

As the data is unbalanced in favor of one category of embeddings, I am obtaining too many sub-clusters of that category, which I would like to squash together. I have found that datapoints with a cosine distance <0.7 should belong in the same cluster, and if I understand correctly I should set cluster_selection_epsilon=0.7 to achieve this outcome.

This doesn't seem to be working as all the datapoints and up in the same cluster (the value is too high?).

My current code:

from cuml.metrics import pairwise_distances
from hdbscan import HDBSCAN
import numpy as np
import cupy as cp  
import cuml

embeddings_gpu = cp.asarray(embeddings)

umap_model = cuml.UMAP(n_neighbors=15,
                       n_components=100, 
                       metric='cosine')
reduced_data_gpu = umap_model.fit_transform(embeddings_gpu)

cosine_dist = pairwise_distances(reduced_data_gpu, metric='cosine')

clusterer = HDBSCAN(min_cluster_size=5, 
                    gen_min_span_tree=True,
                    metric="precomputed",
                    cluster_selection_epsilon=0.7) 
cluster_labels = clusterer.fit_predict(cosine_dist.astype(np.float64).get())

cluster_labels:

Shape: 9533
array([0, 0, 0, ..., 0, 0, 0])

cosine_dist:

Shape: (9533, 9533)
array([[5.9604645e-07, 1.6956329e-02, 5.4422319e-02, ..., 1.0555809e+00,
        1.1026136e+00, 1.3615031e+00],
       ...,
       [1.3615031e+00, 1.4514638e+00, 1.3940278e+00, ..., 3.1383842e-01,
        7.0653200e-02, 5.9604645e-07]], dtype=float32) 

Is this the correct use of cluster_selection_epsilon? Thanks