scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 505 forks source link

Parent and Child ID in Condensed Tree dataframe #312

Open NDBaxi opened 5 years ago

NDBaxi commented 5 years ago

Dear Authors, I am using HDBSCAN for my research project. My data frame is of dimension 21263 X 81. I want to relate the Parent ID and Child ID inside the condensed tree data frame with my original data set, so that I can actually extract the clusters of similar records. I tried to look for forum answers literatures of HDBSCAN regarding this topic and I am not able to map the Parent ID, Child ID back to my original data set. How can I achieve my goal? I also noticed that array of values given by labels_ method are different than the Parent ID and Child ID. Can anyone please explain the connection between them and also how to map them against each other? I would be grateful if you could provide some explanation on above two topics. Best regards,

lmcinnes commented 5 years ago

Quoting from the docstring:

        Each row of the dataframe corresponds to an edge in the tree.
        The columns of the dataframe are `parent`, `child`, `lambda_val`
        and `child_size`.
        The `parent` and `child` are the ids of the
        parent and child nodes in the tree. Node ids less than the number
        of points in the original dataset represent individual points, while
        ids greater than the number of points are clusters.

Thus if the child id is less than (in your case) 21263 then it is the index into your dataset of a single data point. Otherwise it is the index of a cluster. The labels_ are a flat clustering extracted from this tree, and relabelled to be simpler. You can look at the _select_clusters method (https://github.com/scikit-learn-contrib/hdbscan/blob/d3da6a7a01528c7a8d9ee6f73fb112d40948109d/hdbscan/plots.py#L234) to see an example of how clusters are extracted from the tree and relabelled.

NDBaxi commented 5 years ago

Hi Leland, Thanks for your prompt response. First, I would like to complement yourself, other authors and many other contributors that have helped to create HDBSCAN. It's a super-good tool for navigating through variable density data set that will make machine learning on non-linear clustered data really valuable. Thank you for developing a great tool. Back to my earlier questions - Explanation on Parent ID, Child ID and labels is clear. From one of your response to the earlier post, I am now also able to extract contents of original data points in the labels array. I had also used select_clusters method prior to reaching out to you. Just one last thing - Now, how do I get my hands on list of original data points inside Parent ID that are > 21263 and those inside the child_size of condensed tree method?