Open ZenBel opened 3 years ago
In order to do an apples to apples comparison you need to check that both algorithms are returning similar results. Dbscan can be blazingly fast if you are taking very small values for your epsilon parameter. It can also be incredibly slow if you select an epsilon parameter that is too large. I feel that both comparisons are misleading because neither end of the parameter spectrum provides useful results.
The way we did the comparison in this paper ( https://arxiv.org/abs/1705.07321) was to perform a parameter sweep across dbscan to find the parameter value which most closely matched the results from our hdbscan (measured via adjusted rand index between the cluster results). Once we had found the right parameter for generating similar results we ran both algorithms with those values and compared the speeds. This does bypass the difficulty of parameter selection in dbscan but it does, I think, provide the fairest comparison of run time between the two algorithms.
On Thu, Oct 21, 2021 at 4:02 PM ZenBel @.***> wrote:
Hello,
I am comparing HDBSCAN and DBSCAN clustering speeds for a dataset X of dimension 40000x228. I have tested both algorithms with the default settings:
from hdbscan import HDBSCAN form sklearn.cluster import DBSCAN
hdbscan = HDBSCAN().fit(X) dbscan = DBSCAN().fit(X)
Both algorithms don't take up much RAM (about 1GB) but it appears that DBSCAN is much faster than HDBSCAN. The former takes about 18 seconds, while the latter around 11 minutes. This seems to contradict what's written here https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html .
Is this behaviour to be expected or am I missing something?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWRJLYKO4MUEJPYQZZDUIBWWPANCNFSM5GO6PTJQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Right, it makes sense. Thanks a lot for the clarification and the link to the article!
Z
Hello,
I am comparing HDBSCAN and DBSCAN clustering speeds for a dataset
X
of dimension 40000x228. I have tested both algorithms with the default settings:Both algorithms don't take up much RAM (about 1GB) but it appears that DBSCAN is much faster than HDBSCAN. The former takes about 18 seconds, while the latter around 11 minutes. This seems to contradict what's written here.
Is this behaviour to be expected or am I missing something?