scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

Return exemplar indices in data instead of actual points. #304

Open hammadmazhar1 opened 5 years ago

hammadmazhar1 commented 5 years ago

Hi, Using HDBSCAN to cluster network traffic data. I see that there is a way to retrieve the exemplar points representing the clusters. Would it be possible to instead return the indices for these points in the provided data, so that it is easier to correlate with raw data (I do some transformations to create numerical representations of hostnames). If there is already a method to do so, please let me know.

lmcinnes commented 5 years ago

It is certainly possible -- I don't have the exact approach to hand right now, but if you look at the code in the prediction.py file you'll see how the exemplar extraction is done, and it should be relatively straightforward to adapt that.

jsgroob commented 5 years ago

@hammadmazhar1

Not sure if you figured this out or not, but here's a code snippet that works for me. In the following, clusterer is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.


selected_clusters = clusterer.condensed_tree_._select_clusters()
raw_condensed_tree = clusterer.condensed_tree_._raw_tree

exemplars = []
for cluster in selected_clusters:

    cluster_exemplars = np.array([], dtype=np.int64)
    for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster):
        leaf_max_lambda = raw_condensed_tree['lambda_val'][
            raw_condensed_tree['parent'] == leaf].max()
        points = raw_condensed_tree['child'][
            (raw_condensed_tree['parent'] == leaf) &
            (raw_condensed_tree['lambda_val'] == leaf_max_lambda)]
        cluster_exemplars = np.hstack([cluster_exemplars, points])
    exemplars.append(cluster_exemplars)```
Humbertzhang commented 2 years ago

@hammadmazhar1

Not sure if you figured this out or not, but here's a code snippet that works for me. In the following, clusterer is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.

selected_clusters = clusterer.condensed_tree_._select_clusters()
raw_condensed_tree = clusterer.condensed_tree_._raw_tree

exemplars = []
for cluster in selected_clusters:

    cluster_exemplars = np.array([], dtype=np.int64)
    for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster):
        leaf_max_lambda = raw_condensed_tree['lambda_val'][
            raw_condensed_tree['parent'] == leaf].max()
        points = raw_condensed_tree['child'][
            (raw_condensed_tree['parent'] == leaf) &
            (raw_condensed_tree['lambda_val'] == leaf_max_lambda)]
        cluster_exemplars = np.hstack([cluster_exemplars, points])
    exemplars.append(cluster_exemplars)```

For anyone meet this error: AttributeError: 'NoneType' object has no attribute '_recurse_leaf_dfs', You can try add below code at the beginning of @jsgroob's code, it works for me :).

if clusterer._prediction_data is None:
    clusterer.generate_prediction_data()