scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.72k stars 491 forks source link

Non-deterministic behavior #409

Open alexgcsa opened 3 years ago

alexgcsa commented 3 years ago

Hi there,

I was wondering if HDBSCAN is deterministic or not. If its behavior is not deterministic, it would be relevant to add a random seed to initialize and control the generation of pseudo-random numbers during its proces.

Could you clarify it?

Cheers,

Alex de Sá

chronchi commented 3 years ago

Hi Alex, you can take a look at this page where they go over how hdbscan works: https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html

Based on the paper as well as in the page above, one can say the algorithm is deterministic.

warrior-galaxy commented 3 years ago

I believe that the results might look different due to the different labelling order for the clusters when you apply pairwise distance.

azharjuman commented 3 years ago

If you read how the algorithm works : How HDBSCAN Works, there is a step for generating Minimum Spanning Tree, and I believe this might lead to non-deterministic behaviour, since a unique MST cannot be guaranteed for a graph with non-unique edge weights.

landEpita commented 1 year ago

why when i add juste one point all the cluster change ? i try whithout add point, cluster stay the same but when i add juste one point all cluster change

KartikKannapur commented 2 months ago

I have had deterministic results with the following:

  1. import numpy as np
    np.random.seed(42)
  2. For HDBSCAN, set gen_min_span_tree=False and approx_min_span_tree=False