scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

HDBSCAN for graph data #533

Open JanRhoKa opened 2 years ago

JanRhoKa commented 2 years ago

Hello, is it possible to run HDBSCAN on a graph without having to convert the graph into a sparse matrix first ?

Currently we convert the graph into a sparse matrix and use the "precomputed" metric of HDBSCAN to create outlier scores. This unfortunately takes quite a while to compute as we have to run a shortest paths algorithm for very vertex to create a complete distance matrix. It would be a lot fast if we could just start at the minimum spanning tree step of the algorithm, as we already have a graph object (iGraph), which could easily be transformed into a minimum spanning tree.

lmcinnes commented 2 years ago

You could reach into the code a little bit and write some routines that will do this. If you looke at the HDBSCAN generic code you can see the heart of the generic case, and realise it isn't really that much code. You can write your own function and replace the min spanning tree computation with your own for a graph, and likely get what you want. The main concern in how best to replace the core distance notion for the graph, but that's probably a calculation you can make. From there you just need to note that the core hdbscan function is mostly just a lot of case handling to take different algorithmic paths based on the in put data; and then a call the _tree_to_labels with the requisite inputs. So wiring that together with your new function to handle graphs directly would not be hard.

If you are feeling ambitious you could even handle the type detection and call your own routine if the input X is an igraph object, and provide a PR to add graph clustering.