scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

Conversion of Condensed tree DataFrame into dendrogram #280

Open pstumps opened 5 years ago

pstumps commented 5 years ago

Hello, I am attempting to create a dendrogram from previously run data. I have generated a Condensed Tree and converted it to a dataframe using to_pandas() method. I saved that data as a .csv file and I can no longer re-perform my original clustering to generate the condensed tree. I have attempted to initialize these data as a condensed tree by first converting the .csv data into a numpy array (which is the format I believe the Condensed Tree object is accepted), then inputting it in hdbscan.plots.CondensedTree(data) however this does not seem to be working as I receive this error when using plot():

File "/Users/pstumps/anaconda3/lib/python3.7/site-packages/hdbscan/plots.py", line 339, in plot
    max_rectangle_per_icicle=max_rectangles_per_icicle)
File "/Users/pstumps/anaconda3/lib/python3.7/site-packages/hdbscan/plots.py", line 113, in get_plot_data
    leaves = _get_leaves(self._raw_tree)
File "/Users/pstumps/anaconda3/lib/python3.7/site-packages/hdbscan/plots.py", line 43, in _get_leaves
    cluster_tree = condensed_tree[condensed_tree['child_size'] > 1]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid 

Is there a way to "reverse generate" a dendrogram from this data? How does plot() accept the condensed tree object?

lmcinnes commented 5 years ago

It wants a numpy structured array, which is, perhaps a trickier thing to deal with. It is certainly possible to reverse engineer the appropriate structured array, but it is not trivial. Fortunately I believe the pandas column names are the relevant structured array record names, so that provides a start. You'll want to look at the structured array documentation or this tutorial and figure out how to construct the right thing.

pstumps commented 5 years ago

Thanks a lot for getting back to me, it's much appreciated. I read through the links you sent me and I believe I was able to get the data in a structured array by using pandas to_records() method. I assume this is also non-trivial, but is it possible to reverse engineer the linkage matrix from these data?

lmcinnes commented 5 years ago

I believe at best you can only recover an approximation of the original linkage data from the condensed tree -- some information was lost along the way. The smaller your min_cluster_size the less information was lost.