scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.74k stars 492 forks source link

Question about distance calculations #242

Open MilesCranmer opened 5 years ago

MilesCranmer commented 5 years ago

Hi, I was just wondering what part of the code calculates distances for vanilla HDBSCAN and if this could be easily replaced by a custom metric?

My use-case is that I have around 77 million data points, and want to use a custom distance metric. However, computing this beforehand is obviously impossible memory-wise, so I just want to insert the distance metric into HDBSCAN.

From testing out a custom callable for the metric of sklearn.metrics.pairwise_distances, it seems like it is much too slow to add this way, as it loops through every single point. My custom metric can be run on a vector of displacements so I think it would be faster to insert it at the source.

MilesCranmer commented 5 years ago

So I looked more and I think I found it. The distance metric looks to be in Cython, and is called around https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan_.py#L187-L193 as a functional. So is it right that I just replace distmetric here, and [here](https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan.py#L216-L225), to write a custom distance metric? And I suppose the KDTree part as well...

Thanks! And awesome package by the way, it works astoundingly well. Best, Miles

lmcinnes commented 5 years ago

Sadly I think you need to hack quite a few places to make this feasible for the fast version of the algorithm. The metric needs to be supported in Cython, and in the KDTree implementation (which you could hack into a custom build of sklearn). That's feasible, but a non-trivial amount of work. I am planning on eventually getting a future clustering algorithm written that is similar, but would use numba and support custom distance metrics. I have no timeline on that though.

On Tue, Oct 16, 2018 at 4:34 PM Miles Cranmer notifications@github.com wrote:

So I looked more and I think I found it. The distance metric looks to be in Cython, and is called around https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan_.py#L187-L193 as a functional. So is it right that I just replace distmetric here, and here <https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan.py#L216-L225>, to write a custom distance metric? And I suppose the KDTree part as well...

Thanks! And awesome package by the way, it works astoundingly well. Best, Miles

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/242#issuecomment-430390390, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBUEjrlxzDfh95t9nB7jMS2wxWDpIks5ulkKkgaJpZM4XhbUh .

MilesCranmer commented 5 years ago

Thanks for the info! If I could ask one more question to point me in the right direction, do you think you could guess roughly which is the most time-consuming part of the algorithm that uses the metric? The KDTree part or the local (mutual reachability) parts? I'm working on a many-core machine so if I could just parallelize the part where the metric is called maybe I can speed it up enough.

Thanks!

lmcinnes commented 5 years ago

The most compute intensive part of the algorithm is the dual tree Boruvka computation of the minimal spanning tree, and that makes a lot of calls to the metric. It is also the "trick" that allows the algorithm to complete with far fewer calls to the metric than would otherwise be possible (sub-quadratic complexity). It ultimately involves either a KDTree or BallTree to do the heavy lifting, so it won't be easy to replace.

An alternative approach might be as follows: use an algorithm that supports K-nearest-neighbor queries (or KNN graph construction) efficiently (and, ideally, parallelisably). I have a package called PyNNDescent that will allow custom metrics using Numba that might do the job, but you might also want to investigate FAISS, KGraph and NMSLib as alternatives. Once you have the KNN graph you can create a sparse matrix of distances (just distances to k-nearest neighbors for each point for some reasonable value of k that works for you). HDBSCAN can work with precomputed sparse distance matrices where entries that are zero in the sparse representation (i.e. not supplied) are assumed to be infinite distances. This might well be enough to do the clustering. You might have to be careful and run connected components on the sparse matrix first and then run HDBSCAN on each component, but other than that I think this approach might work.

On Tue, Oct 16, 2018 at 5:54 PM Miles Cranmer notifications@github.com wrote:

Thanks for the info! If I could ask one more question to point me in the right direction, do you think you could guess roughly which is the most time-consuming part of the algorithm that uses the metric? The KDTree part or the local (mutual reachability) parts? I'm working on a many-core machine so if I could just parallelize the part where the metric is called maybe I can speed it up enough.

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/242#issuecomment-430414878, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBbDzMKF95vCL4K3LEvJJuWxEVYYxks5ullWbgaJpZM4XhbUh .

MilesCranmer commented 5 years ago

Awesome, thanks for the tips! Since I am dealing with 70 million points, is it possible to supply a proper precomputed sparse matrix object via, e.g., scipy.sparse?

I just ran a custom metric with HDBSCAN on 9000 points and it took more than an hour whereas the normal euclidean metric is ~0.5 seconds, so I definitely need to do something smarter. Although my metric is a bit complicated - it involves a neural net with some postprocessing. But vector-wise it is fast.

lmcinnes commented 5 years ago

Yes, it detects whether you are using a scipy.sparse matrix and operates accordingly. Just to be clear, 70 million points is well beyond anything I would expect to be able to run in anything resembling a reasonable amount of time.

On Thu, Oct 18, 2018 at 6:42 PM Miles Cranmer notifications@github.com wrote:

Awesome, thanks for the tips! Since I am dealing with 70 million points, is it possible to supply a proper precomputed sparse matrix object via, e.g., scipy.sparse?

I just ran a custom metric with HDBSCAN on 9000 points and it took more than an hour whereas the normal euclidean metric is ~0.5 seconds, so I definitely need to do something smarter. Although my metric is a bit complicated - it involves a neural net with some postprocessing. But vector-wise it is fast.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/242#issuecomment-431188259, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBRyvS3q9-AAhKsiXS-0YZ7LRNfl-ks5umQPBgaJpZM4XhbUh .

MilesCranmer commented 5 years ago

Thanks, that's great! What do you think is a reasonable # of nearest neighbors to calculate for?

So I wrapped sklearn's NearestNeighbors in a dask joblib parallel_backend and it scales it enough (with some hacks) that I can calculate this custom distance for 6 nearest points in a reasonable time... Basically my distance metric is a neural net in PyTorch so I can't really port it to anything yet. (there could also be problems with neural nets being so nonlinear that the triangle inequality can be violated and the KDTree broken... but it seems fine so far).

I do plan on splitting the 70 million into boxes (I'm recording the scaling performance for HDBSCAN, and splitting into boxes (over some parameter that hopefully won't cut through clusters) until the expected time is < 1 week).

lmcinnes commented 5 years ago

The short answer is that more neighbors are better, so whatever seems practicable to compute. If 6 seems tractable and more would be too hard, then go with that. If you can get up to say 10 or 15 that would definitely be better, but it's really a matter of what compute you can get away with.