scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

hdbscan with custom metric #146

Open sipan17 opened 6 years ago

sipan17 commented 6 years ago

Help needed! I want to run the algorithm based on correlation distance. I would like to get assistance with doing so. With default metrics I ran the algorithm on 1.5m+ 128 dimensional vectors with 'boruvka_kdtree' algorithm, but with custom metric even with 70k vectors it gives memory error on 32GB RAM. Can you assist me with adding custom metric, which will not give memory error, or give some instruction on how to use metric 'pyfunc'?

lmcinnes commented 6 years ago

Sadly there are some serious catches here. To make use of the efficient approaches to HDBSCAN the algorithm needs to make heavy use of the triangle inequality. Unfortunately correlation distance does not respect the triangle inequality which rules out almost all the ways of doing this efficiently. The result is that it has to fall back on effectively a brute force method which, as you note, simply does not scale (it it O(N^2) both in compute and memory). There simply aren't any easy fixes for this -- while it is possible to create something that could at least be linear in memory it will still be O(N^2) in computational complexity and there is no way it will scale to 1.5m data points.

If you need to cluster this data then I could point you to UMAP which is a dimension reduction algorithm that does support custom metrics including correlation distance. The goal would be to use UMAP to reduce your data to, say, 10 dimensions in Euclidean space -- you can then cluster with hdbscan using a euclidean metric quite efficiently.

A minor catch on this: there were some bugs in the correlation distance computation, so be sure to clone from master on the repository rather than pip installing if you want to use correlation distance.

This is certainly not the same as actually clustering your data with correlation distance, but it is the best I can offer at this time.

sipan17 commented 6 years ago

Thanks a lot for your advice! Can you please tell how to use pyfunc as a metric for the algorithm. I didn't find any documentation on this.

lmcinnes commented 6 years ago

I believe that, in principle, as with KD-Trees, you can pass any suitable python function as the argument to metric and it should work with that. I would warn that:

  1. This will be incredibly slow: there will be roundtripping from Cython up to Python and back for every distance call and that is extremely expensive computationally.
  2. I won't guarantee that this will actually work end to end with hdbscan and produce a clustering. As you note it isn't documented, and isn't really officially supported.