I found that when I upgrade from numpy 1 to 2, the clustering results are different on different platforms. This behavior didn't happen on numpy 1. I also tested setting numpy seeds and PYTHONHASHSEED and neither helped.
the issue happened when I upgraded from numpy 1.26.4 to numpy 2.1.1 and keeping all other packages the same.
You can reproduce it with this data by reading it into a dataframe then run HDBSCAN.fit(df) and setting cluster_selection_epsilon = 0.15 + the parameters in the json file.
On Linux-6.5.11-linuxkit-x86_64-with-glibc2.36 the exemplars for cluster 4 has 10 items (this is running on Apple M2)
On Linux-5.10.223-212.873.amzn2.x86_64-x86_64-with-glibc2.36 the exemplars for cluster 4 has only 5 items (this is running on one of the AWS machines, but seems to happen on all EC2 instances we have)
Both returned the same clusters -- only the exemplars are different. Also on numpy ` they returned the same exemplars.
What
I found that when I upgrade from numpy 1 to 2, the clustering results are different on different platforms. This behavior didn't happen on numpy 1. I also tested setting numpy seeds and
PYTHONHASHSEED
and neither helped.How to reproduce
poetry dependency:
the issue happened when I upgraded from numpy
1.26.4
to numpy2.1.1
and keeping all other packages the same.You can reproduce it with this data by reading it into a dataframe then run
HDBSCAN.fit(df)
and settingcluster_selection_epsilon = 0.15
+ the parameters in the json file.data.json
The platform name is printed with
platform.platform()
Linux-6.5.11-linuxkit-x86_64-with-glibc2.36
the exemplars for cluster 4 has 10 items (this is running on Apple M2)Linux-5.10.223-212.873.amzn2.x86_64-x86_64-with-glibc2.36
the exemplars for cluster 4 has only 5 items (this is running on one of the AWS machines, but seems to happen on all EC2 instances we have)Both returned the same clusters -- only the exemplars are different. Also on numpy ` they returned the same exemplars.