[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN?

jcfaracco commented 2 years ago

What is your question?

Hello all,

I'm trying to validate both HDBSCAN's and I'm getting a weird result. To explain it better, I'm gonna show you a simple code that proves the differences between them. I really don't know if I'm making any mistake, if it is a bug or a missing feature, or if it is even working as designed.

import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs as make_blobs_cpu

from hdbscan import HDBSCAN as HDBSCAN_CPU
from cuml.cluster import HDBSCAN as HDBSCAN_GPU

np.random.seed(11)

sns.set_context('poster')
sns.set_style('white')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

blobs_file = 'blobs.pickle'

if not os.path.exists(blobs_file):
    blobs, _ = make_blobs_cpu(n_samples=4000, centers=[(-0.75,2.25), (1.0, 2.0), (1.0, 1.0), (2.0, -0.5), (-1.0, -1.0), (0.0, 0.0)], cluster_std=0.5)
    test_data = np.vstack([blobs])

    with open(blobs_file, 'wb') as handle:
        pickle.dump(test_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
else:
    with open(blobs_file, 'rb') as handle:
        test_data = pickle.load(handle)

plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)

clusterer = HDBSCAN_CPU(min_samples=1, min_cluster_size=100)

clusterer.fit(test_data)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                  if col < len(palette) else (0.5, 0.5, 0.5) for col, sat in
                  zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)

clusterer_gpu = HDBSCAN_GPU(min_samples=1, min_cluster_size=100)

clusterer_gpu.fit(test_data)

palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                  if col < len(palette) else (0.5, 0.5, 0.5) for col, sat in
                  zip(clusterer_gpu.labels_, clusterer_gpu.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)

I would love if I could share the plots I'm getting, but I cannot attach images here.

I read the API paragraph that mentions some variance between both versions but small ones and not significant variances like I'm seeing:

Note that while the algorithm is generally deterministic and should provide matching results between RAPIDS and the Scikit-learn Contrib versions, the construction of the k-nearest neighbors graph and minimum spanning tree can introduce differences between the two algorithms, especially when several nearest neighbors around a point might have the same distance. While the differences in the minimum spanning trees alone might be subtle, they can (and often will) lead to some points being assigned different cluster labels between the two implementations.

I also read the HDBSCAN feature request here which explain some points of the implementation: https://github.com/rapidsai/cuml/issues/1783

If you have any recommendation or guideline to avoid this variation I would be glad. I think that we should be able to validate both versions even if CuML's HDBSCAN has less features than the scikit version.

divyegala commented 2 years ago

@jcfaracco the first intuition I have is that your min_samples is really low. Can you try increasing it? If your data is really dense, it is possible that the first neighbor (because min_samples=1) may be found differently in the kNN step just through floating point error

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

garyhsu29 commented 1 year ago

I have the same question here. In my case, the min_samples=2 and min_cluster_size = 2. The cuml HDBSCAN yield a very different result than the CPU version of HDBSCAN.

beckernick commented 1 year ago

@garyhsu29 , does this still happen if you increase the value of your hyperparameters? Are you using the same data as in the example above?

jcfaracco commented 2 months ago

@beckernick today I did an experiment with the HDBSCAN in scikit-learn. I got the same results. I see some inconsistencies with 3 datasets at least (I used the same as the example). I clearly see how the hyperparameters matter, but the point is how the same hyperparameters cause different results when we fit scikit's HDBSCAN and RAPIDS' HDBSCAN. For me, it is fine to have some inconsistencies between CPU and GPU versions depending on how the algorithm was implemented, but I wonder why technically.

cjnolet commented 2 months ago

@jcfaracco can you share some more information about the differences you are seeing? Are you seeing completely different clusterings or are there specific points that are showing up in some clusters? Are points being grouped together similarly but with different cluster labels assigned to them?

There are several factors of varying implmentations that can cause two different implementations to yield results which are correct yet still different. First, the minimum spanning trees themselves can be approximate and I would not expect an approximate algorithm to yield the exact same results in two different implementations.

You should be able to drag and drop images into the comment window of Github. It would be great if you could share some images, or at least a rough description of the differences you are seeing.

jcfaracco commented 2 months ago

@cjnolet here is a visual overview of the two versions (including the original dataset and the diff):

The Diff plot contains some classes in yellow, orange, and light blue that show the diffs between CPU and GPU versions. In regular blue, we have the same classification.

rapidsai / cuml

[QST] Why does scikit HDBSCAN return different results when we compare with CuML's HDBSCAN? #4723