scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

Debugging performance issues (possible need to update benchmarks to consider varying dimensionality) #100

Open jlorince opened 7 years ago

jlorince commented 7 years ago

Not sure if I'm encountering a bug, but I've been running against some performance bottlenecks that don't jive with the benchmarks you present in the docs.

I'm trying to run hdbscan on a ~500k x 100 numpy array, and after ~17 hours it still hasn't finished. This is on a 60-core windows machine, python 2.7, using:

hd = hdbscan.HDBSCAN(min_cluster_size=100,core_dist_n_jobs=60,prediction_data=True)
%time hd = hd.fit(data)

What I want to determine is, is this some oddity of my data, my machine, or the dimensionality of my data. Looking at the benchmarking code in the docs, it seems if all the tests are run on 10 dimensional vectors, but I don't know the algorithm well enough to have a sense of how it should be scaling with the dimensionality of the feature vectors.

I've only done some preliminary exploration, but from what I can see, performance scales much worse than linearly in the number of features. The following:

import hdbscan
hd = hdbscan.HDBSCAN(min_cluster_size=100,core_dist_n_jobs=60)
for k in (10,50,100):
    for n in (1000,10000,50000):
        data  = normed[:n,:k]
        print(n,k)
        %time hd = hd.fit(data)

yields:

(1000, 10) Wall time: 77 ms (10000, 10) Wall time: 3.93 s (50000, 10) Wall time: 56.7 s

(1000, 50) Wall time: 142 ms (10000, 50) Wall time: 9.69 s (50000, 50) Wall time: 2min 11s

(1000, 100)Wall time: 200 ms (10000, 100) Wall time: 16.2 s (50000, 100)Wall time: 10min 30s

Not exhaustive testing, I know, but the fact doubling the number of features (from 50 to 100) for a fixed number of observations (50k) leads to 5x longer computation time is concerning.

Like I said, this may not be revealing any sort of bug/problem, but at the least suggests that the benchmarking in the docs might be a bit misleading, and there are clearly non-linear effects w.r.t. to not just the number of observations, but also the number of features.

Any thoughts on how to work around this or ideas on what might be going on?

lmcinnes commented 7 years ago

Performance degrades quite badly in dimensionality (in practice potentially exponentially badly). This is where dimension reduction starts to become important (and is the focus of my current research). Taking 17 hours does sound rather bad, but unfortunately the parallelism applies to only a limited part of the algorithm, so while you will see some benefit from the 60 cores, potentially a majority of the computation is single threaded. Parallelising the algorithm is another research problem I am working on.

On Fri, Apr 7, 2017 at 11:08 AM, Jared Lorince notifications@github.com wrote:

Not sure if I'm encountering a bug, but I've been running against some performance bottlenecks that don't jive with the benchmarks you present in the docs.

I'm trying to run hdbscan on a ~500k x 100 numpy array, and after ~17 hours it still hasn't finished. This is on a 60-core windows machine, python 2.7, using:

hd = hdbscan.HDBSCAN(min_cluster_size=100,core_dist_n_jobs=60,prediction_data=True) %time hd = hd.fit(data)

What I want to determine is, is this some oddity of my data, my machine, or the dimensionality of my data. Looking at the benchmarking code in the docs, it seems if all the tests are run on 10 dimensional vectors, but I don't know the algorithm well enough to have a sense of how it should be scaling with the dimensionality of the feature vectors.

I've only done some preliminary exploration, but from what I can see, performance scales much worse than linearly in the number of features. The following:

import hdbscan hd = hdbscan.HDBSCAN(min_cluster_size=100,core_dist_n_jobs=60) for k in (10,50,100): for n in (1000,10000,50000): data = normed[:n,:k] print(n,k) %time hd = hd.fit(data)

yields:

(1000, 10) Wall time: 77 ms (10000, 10) Wall time: 3.93 s (50000, 10) Wall time: 56.7 s

(1000, 50) Wall time: 142 ms (10000, 50) Wall time: 9.69 s (50000, 50) Wall time: 2min 11s

(1000, 100)Wall time: 200 ms (10000, 100) Wall time: 16.2 s (50000, 100)Wall time: 10min 30s

Not exhaustive testing, I know, but the fact doubling the number of features (from 50 to 100) for a fixed number of observations (50k) leads to 5x longer computation time is concerning.

Like I said, this may not be revealing any sort of bug/problem, but at the least suggests that the benchmarking in the docs might be a bit misleading, and there are clearly non-linear effects w.r.t. to not just the number of observations, but also the number of features.

Any thoughts on how to work around this or ideas on what might be going on?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/100, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBUc5cpP2q5dsXRM6J86TL8mWef9wks5rtlF4gaJpZM4M3BNw .

jlorince commented 7 years ago

Ok, well that confirms I'm not crazy haha. But a note to this effect would be really useful to include on the benchmarking page of the docs. As it stands it can be a bit misleading. That is to say, not mentioning it in the performance discussion would imply that it's not important for performance, but what you're saying here is that it might be a much more important determinant of runtime than number of observations (in the limit).

lmcinnes commented 7 years ago

That's fair, although originally it was more of a performance comparison and, for example, DBSCAN will scale equally badly with dimension. I'll see if I can get the docs updated to point out the issue.

On Fri, Apr 7, 2017 at 11:38 AM, Jared Lorince notifications@github.com wrote:

Ok, well that confirms I'm not crazy haha. But a note to this effect would be really useful to include on the benchmarking page of the docs https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html. As it stands it can be a bit misleading. That is to say, not mentioning it in the performance discussion would imply that it's not important for performance, but what you're saying here is that it might be a much more important determinant of runtime than number of observations (in the limit).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/100#issuecomment-292571174, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdytWaidEs-mnhXy9wcrJImUkWRvks5rtlhogaJpZM4M3BNw .

DataWaveAnalytics commented 7 years ago

Hello,

Sorry for commenting here, but I also hit a wall when trying to cluster around ~350K x 2 numpy array. Basically, I cannot replicate your performance experiments described here (I went for a coffee and the machine was swapping like crazy LOL). My machine is a 8 cores, 32 Gb RAM.

Is there any performance improvement since this post?

Thank you

lmcinnes commented 7 years ago

Worst case there has been a performance regression -- but I do believe we have done experiments quite recently with 512000 points in 2 dimensions that ran in under a minute. My best guess is that there is some reason that your code is taking a bad code path. The other alternative is that, since we rely on space-trees for performance, and they are susceptible to bad data distributions, your data may just be problematic. Is it possible to share the data?

DataWaveAnalytics commented 7 years ago

Thank you @lmcinnes for your reply. I attached the data in this comment. Please, let me know if you can get a better performance. The format in the file is (space separated):

... [data.txt.zip](https://github.com/scikit-learn-contrib/hdbscan/files/1027554/data.txt.zip)
lmcinnes commented 7 years ago

Thanks for sharing the data -- it is very helpful. I so far only have equivocal news to report. On my linux desktop I saw similar to what you did with huge amounts of swapping (although it is a very basic machine and has very little RAM). On my macbook, however, everything went smoothly and the whole clustering run completed in about 30s. I'll have to dig a little deeper to see what the issue is.

lmcinnes commented 7 years ago

It seems to be related to min_samples being large (which, in turn, is set by min_cluster_size if not set explicitly), and relates to the core distance query which is handled by scikit-learn's KDTree, presumably using up memory in creating a large neighbour heap for each query point. I'll have to dig into this further later, but for now my recommendation is to try with min_samples set to 200 or less and you should see reasonable performance. Explicitly you want something like

clusterer = hdbscan.HDBSCAN(min_cluster_size=1000, min_samples=100)
clusterer.fit(data)

where you can vary min_cluster_size, but should make sure you keep min_samples specified and set to something small. Hopefully this is enough to get you over your immediate hurdles. I'll try to work out if we can improve performance for large min_samples values, but potentially that may be harder.

DataWaveAnalytics commented 7 years ago

I take the data from the snap repository. In particular, the dataset corresponds to this network. I applied the LINE and then LargeVis approaches to get the reductions.

It is intriguing that you cannot reproduce the behavior in your macbook, but anyway you'are right. After your recommendations to set min_samples with smaller values, I was able to get clustering results in 14 secs. for this data on my Linux machine. I hope you can include this behavior somewhere in your documentation. Thank you for your help.

lmcinnes commented 7 years ago

I will add it to the FAQ in the documentation. It comes down to the memory requirements for large min_samples values on large datasets. As soon as you use up enough memory to hit swap things go pretty badly.