scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 505 forks source link

HDBScan Scalability with precomputed matrices #254

Open jeffreyrnorton opened 6 years ago

jeffreyrnorton commented 6 years ago

We are interested in using HDBScan for bigger problems, but scalability appears to be an issue.

For example, at https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan_.py, line 80 a copy is performed on the input matrix instead of using the original. Why? Is the matrix being changed?

and at https://github.com/scikit-learn-contrib/hdbscan/blob/master/hdbscan/hdbscan_.py, line 174-175, the type is converted to a float64 which I am assuming is for numeric convergence issues - but is it necessary?

We have considered two approaches for addressing the lack of scalability: 1) Use our own matrix like objects which use tools such as memmap to address matrices to big to fit in memory. Besides access (rewriting __get_item__), what other issues to we need to consider for the precomputed matrix? 2) In general, shouldn't it just be possible to input a distance measure callable class and compute the measure on the fly. A major problem with pairwise_distances is that it only accepts real numbers, but this is far too restrictive. I don't know if it could work, but what should happen is that everytime the measure is required, you just call the measurement function for the two points (vectors) on which you are getting the distance. Is that even conceivable with the current implementation?

lmcinnes commented 6 years ago

The scalability for pre-computed matrices is really not great -- that wasn't a use-case I designed for. In practice you can probably get away just fine with float32 with minimal loss, but there will be flow on effects in the Cython that will potentially be problematic. This will hold with a memmap matrix that attempts to address issues: you would need to ensure that the memory layout, as seen from C code, looks the same as a numpy array.

In practice if I was optimizing a version of hdbscan specifically for low memory use on large pre-computed distance matrices I would go about things somewhat differently. Working in such cases is much easier than some of the more complex code to handle vector space data highly efficiently (particularly from a runtime perspective). Potentially one could simply rewrite those core portions, and then use the remaining hdbscan code to handle the rest. Ultimately for a pre-computed distance matrix you want to replace this function: https://github.com/scikit-learn-contrib/hdbscan/blob/9840b7e1d4a75765c20c9e45686c3b657cf415de/hdbscan/hdbscan_.py#L68 with something more scalable (and the associated preprocessing). You can likely re-use the label function (which is really just a union find on the minimum spanning tree). So, ultimately, you just need to convert your pre-computed distance matrix to a mutual reachability matrix, and then run single linkage on it. The results of that can be passed to the other hdbscan routines to get the rest of what you need efficiently. In practice you can do that with a little matrix manipulation and a call to fastcluster. Is this the sort of thing you had in mind?