scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

Different exemplars from same clusters with Numpy 2 on different platforms #655

Open changhsinlee opened 4 weeks ago

changhsinlee commented 4 weeks ago

What

I found that when I upgrade from numpy 1 to 2, the clustering results are different on different platforms. This behavior didn't happen on numpy 1. I also tested setting numpy seeds and PYTHONHASHSEED and neither helped.

How to reproduce

poetry dependency:

# poetry.toml
[tool.poetry.dependencies]
python = "^3.12"
pandas = "^2.2.2"
numpy = "^1.26.4"
hdbscan = ">=0.8.38"
scikit-learn = "^1.5.1"

the issue happened when I upgraded from numpy 1.26.4 to numpy 2.1.1 and keeping all other packages the same.

You can reproduce it with this data by reading it into a dataframe then run HDBSCAN.fit(df) and setting cluster_selection_epsilon = 0.15 + the parameters in the json file.

data.json

The platform name is printed with platform.platform()

Both returned the same clusters -- only the exemplars are different. Also on numpy ` they returned the same exemplars.