Open sipan17 opened 6 years ago
Sadly there are some serious catches here. To make use of the efficient approaches to HDBSCAN the algorithm needs to make heavy use of the triangle inequality. Unfortunately correlation distance does not respect the triangle inequality which rules out almost all the ways of doing this efficiently. The result is that it has to fall back on effectively a brute force method which, as you note, simply does not scale (it it O(N^2) both in compute and memory). There simply aren't any easy fixes for this -- while it is possible to create something that could at least be linear in memory it will still be O(N^2) in computational complexity and there is no way it will scale to 1.5m data points.
If you need to cluster this data then I could point you to UMAP which is a dimension reduction algorithm that does support custom metrics including correlation distance. The goal would be to use UMAP to reduce your data to, say, 10 dimensions in Euclidean space -- you can then cluster with hdbscan using a euclidean metric quite efficiently.
A minor catch on this: there were some bugs in the correlation distance computation, so be sure to clone from master on the repository rather than pip installing if you want to use correlation distance.
This is certainly not the same as actually clustering your data with correlation distance, but it is the best I can offer at this time.
Thanks a lot for your advice! Can you please tell how to use pyfunc as a metric for the algorithm. I didn't find any documentation on this.
I believe that, in principle, as with KD-Trees, you can pass any suitable python function as the argument to metric and it should work with that. I would warn that:
Help needed! I want to run the algorithm based on correlation distance. I would like to get assistance with doing so. With default metrics I ran the algorithm on 1.5m+ 128 dimensional vectors with 'boruvka_kdtree' algorithm, but with custom metric even with 70k vectors it gives memory error on 32GB RAM. Can you assist me with adding custom metric, which will not give memory error, or give some instruction on how to use metric 'pyfunc'?