Open jcfaracco opened 2 years ago
@jcfaracco the first intuition I have is that your min_samples
is really low. Can you try increasing it? If your data is really dense, it is possible that the first neighbor (because min_samples=1
) may be found differently in the kNN step just through floating point error
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
I have the same question here. In my case, the min_samples=2 and min_cluster_size = 2. The cuml HDBSCAN yield a very different result than the CPU version of HDBSCAN.
@garyhsu29 , does this still happen if you increase the value of your hyperparameters? Are you using the same data as in the example above?
@beckernick today I did an experiment with the HDBSCAN in scikit-learn. I got the same results. I see some inconsistencies with 3 datasets at least (I used the same as the example). I clearly see how the hyperparameters matter, but the point is how the same hyperparameters cause different results when we fit scikit's HDBSCAN and RAPIDS' HDBSCAN. For me, it is fine to have some inconsistencies between CPU and GPU versions depending on how the algorithm was implemented, but I wonder why technically.
@jcfaracco can you share some more information about the differences you are seeing? Are you seeing completely different clusterings or are there specific points that are showing up in some clusters? Are points being grouped together similarly but with different cluster labels assigned to them?
There are several factors of varying implmentations that can cause two different implementations to yield results which are correct yet still different. First, the minimum spanning trees themselves can be approximate and I would not expect an approximate algorithm to yield the exact same results in two different implementations.
You should be able to drag and drop images into the comment window of Github. It would be great if you could share some images, or at least a rough description of the differences you are seeing.
@cjnolet here is a visual overview of the two versions (including the original dataset and the diff):
The Diff plot contains some classes in yellow, orange, and light blue that show the diffs between CPU and GPU versions. In regular blue, we have the same classification.
What is your question?
Hello all,
I'm trying to validate both HDBSCAN's and I'm getting a weird result. To explain it better, I'm gonna show you a simple code that proves the differences between them. I really don't know if I'm making any mistake, if it is a bug or a missing feature, or if it is even working as designed.
I would love if I could share the plots I'm getting, but I cannot attach images here.
I read the API paragraph that mentions some variance between both versions but small ones and not significant variances like I'm seeing:
Note that while the algorithm is generally deterministic and should provide matching results between RAPIDS and the Scikit-learn Contrib versions, the construction of the k-nearest neighbors graph and minimum spanning tree can introduce differences between the two algorithms, especially when several nearest neighbors around a point might have the same distance. While the differences in the minimum spanning trees alone might be subtle, they can (and often will) lead to some points being assigned different cluster labels between the two implementations.
I also read the HDBSCAN feature request here which explain some points of the implementation: https://github.com/rapidsai/cuml/issues/1783
If you have any recommendation or guideline to avoid this variation I would be glad. I think that we should be able to validate both versions even if CuML's HDBSCAN has less features than the scikit version.