scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 497 forks source link

Return exemplars when using precomputed distance matrix? #169

Open helske opened 6 years ago

helske commented 6 years ago

I would like to use some precomputed distance matrices in HDBSCAN, and later predict the cluster memberships of the new datapoints. I understand that this is currently not possible, as HDBSCAN cannot know the actual values of the old points so comparing the new points with them is impossible. But I though that if I could still get the indices of those exemplar point cases, then I could make a crude predictions based on the distances from those exemplar points. Which I think would be like half of the solution used in fuzzy clustering. Would this make any sense or are there some other ways to "hack" the algorithm in order to get predictions with precomputed distance matrix?

helske commented 6 years ago

Oh well, I just realised that HDBSCAN seems to rely on triangle inequality, so using distance matrix computed with dynamic time warping might not be a good idea anyway...

lmcinnes commented 6 years ago

Technically you can get away with breaking the triangle inequality with a precomputed distance matrix -- the triangle inequality is necessary if you want to compute a clustering without doing the N^2 work of computing all the distances.

lmcinnes commented 6 years ago

It is possible to hack around such that you can make predictions based on distance matrices as long as you compute all the distances from the new point to existing points. There is nothing in the code set up to work that way, but I believe if you look through the prediction code you can probably see where to plug it all in. Ask me for more details on any parts that are unclear.

helske commented 6 years ago

Thanks, the issue with this is that I would rather not compute the distances between new points and all the existing points, as this soon becomes quite heavy operation with large samples. That's why I though about computing the distances only between the new points and the exemplar points, in similar fashion as with centroid based clustering algorithms.

lmcinnes commented 6 years ago

You can do that, but it raises issues with mutual reachability distance -- you need some notion of the distance to the kth nearest neighbor of the new point, and that depends on all the points.