Hi, I am using HDBSCAN to cluster text embeddings.
As the data is unbalanced in favor of one category of embeddings, I am obtaining too many sub-clusters of that category, which I would like to squash together. I have found that datapoints with a cosine distance <0.7 should belong in the same cluster, and if I understand correctly I should set cluster_selection_epsilon=0.7 to achieve this outcome.
This doesn't seem to be working as all the datapoints and up in the same cluster (the value is too high?).
Hi, I am using HDBSCAN to cluster text embeddings.
As the data is unbalanced in favor of one category of embeddings, I am obtaining too many sub-clusters of that category, which I would like to squash together. I have found that datapoints with a cosine distance <0.7 should belong in the same cluster, and if I understand correctly I should set
cluster_selection_epsilon=0.7
to achieve this outcome.This doesn't seem to be working as all the datapoints and up in the same cluster (the value is too high?).
My current code:
cluster_labels:
cosine_dist:
Is this the correct use of
cluster_selection_epsilon
? Thanks