scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

HDBSCAN vs DBSCAN speed #504

Open ZenBel opened 3 years ago

ZenBel commented 3 years ago

Hello,

I am comparing HDBSCAN and DBSCAN clustering speeds for a dataset X of dimension 40000x228. I have tested both algorithms with the default settings:

from hdbscan import HDBSCAN
from sklearn.cluster import DBSCAN

hdbscan = HDBSCAN().fit(X)
dbscan = DBSCAN().fit(X)

Both algorithms don't take up much RAM (about 1GB) but it appears that DBSCAN is much faster than HDBSCAN. The former takes about 18 seconds, while the latter around 11 minutes. This seems to contradict what's written here.

Is this behaviour to be expected or am I missing something?

jc-healy commented 3 years ago

In order to do an apples to apples comparison you need to check that both algorithms are returning similar results. Dbscan can be blazingly fast if you are taking very small values for your epsilon parameter. It can also be incredibly slow if you select an epsilon parameter that is too large. I feel that both comparisons are misleading because neither end of the parameter spectrum provides useful results.

The way we did the comparison in this paper ( https://arxiv.org/abs/1705.07321) was to perform a parameter sweep across dbscan to find the parameter value which most closely matched the results from our hdbscan (measured via adjusted rand index between the cluster results). Once we had found the right parameter for generating similar results we ran both algorithms with those values and compared the speeds. This does bypass the difficulty of parameter selection in dbscan but it does, I think, provide the fairest comparison of run time between the two algorithms.

On Thu, Oct 21, 2021 at 4:02 PM ZenBel @.***> wrote:

Hello,

I am comparing HDBSCAN and DBSCAN clustering speeds for a dataset X of dimension 40000x228. I have tested both algorithms with the default settings:

from hdbscan import HDBSCAN form sklearn.cluster import DBSCAN

hdbscan = HDBSCAN().fit(X) dbscan = DBSCAN().fit(X)

Both algorithms don't take up much RAM (about 1GB) but it appears that DBSCAN is much faster than HDBSCAN. The former takes about 18 seconds, while the latter around 11 minutes. This seems to contradict what's written here https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html .

Is this behaviour to be expected or am I missing something?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWRJLYKO4MUEJPYQZZDUIBWWPANCNFSM5GO6PTJQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ZenBel commented 3 years ago

Right, it makes sense. Thanks a lot for the clarification and the link to the article!

Z