scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.82k stars 507 forks source link

Plotting the tree for clustering using cluster_selection_epsilon #571

Open lucetka opened 2 years ago

lucetka commented 2 years ago

The in-built condensedtree.plot seems to be useful only with very simple clustering results because huge and complex trees are virtually illegible, and further, if using cluster_selection_epsilon, the condensed tree seems to be the tree BEFORE the branches affected by the selected epsilon are melted (and the select_cluster method also doesn’t return the final flat clusters after applying the epsilon). I still haven’t found an optimal way to plot large, monstrous trees so that they are really easy to read but currently I’m experimenting with plotly icicle plots and treemaps to show at least the relationships between the clusters in the final result (I put the lambda values in the hover info so it's not totally lost). I’ve tried to adjust the tree to reflect the application of cluster_selection_epsilon and I’d like to check that what I’m doing makes sense. Here is what I do: 1 – first get the flat labeling for both a clustering without the epsilon and with epsilon (otherwise using the same parameters) 2 – match the flat labels in the clustering without epsilon to the clustering with epsilon based on datapoint membership (one cluster in the with epsilon clustering will “engulf” multiple clusters from the clustering without epsilon ); the final eps clusters will be obviously larger than a simple sum of the melted non-eps clusters because they will of course also have the points that were previously noise separating the non-eps clusters, so this is just an intermediate step shown in the middle in my scheme below 3 – match the tree cluster IDs to the flat labels and sort the tree dataframe (I use the pandas version) by lambda. Filter out clusters not affected by the cluster_selection_epsilon (ie those where lambda<= 1/eps) and for each remaining unique flat label of the clustering using cluster_selection_epsilon, take the first row (ie the subcluster with the lowest lambda) and find its parent in the tree; if its lambda is below the 1/eps threshold, then this is the parent which will take on the identity of the flat label; if the lambda is higher than the threshold, look for a “grandparent” and if needed a grand-grandparent etc. Then delete all children in the tree dataframe with lambda higher than the threshold and plot the tree. noeps2eps_labelled Did I interpret the tree and how everything works correctly or am I doing something stupid? Thanks!