Open hammadmazhar1 opened 5 years ago
It is certainly possible -- I don't have the exact approach to hand right now, but if you look at the code in the prediction.py file you'll see how the exemplar extraction is done, and it should be relatively straightforward to adapt that.
@hammadmazhar1
Not sure if you figured this out or not, but here's a code snippet that works for me. In the following, clusterer
is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.
selected_clusters = clusterer.condensed_tree_._select_clusters()
raw_condensed_tree = clusterer.condensed_tree_._raw_tree
exemplars = []
for cluster in selected_clusters:
cluster_exemplars = np.array([], dtype=np.int64)
for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster):
leaf_max_lambda = raw_condensed_tree['lambda_val'][
raw_condensed_tree['parent'] == leaf].max()
points = raw_condensed_tree['child'][
(raw_condensed_tree['parent'] == leaf) &
(raw_condensed_tree['lambda_val'] == leaf_max_lambda)]
cluster_exemplars = np.hstack([cluster_exemplars, points])
exemplars.append(cluster_exemplars)```
@hammadmazhar1
Not sure if you figured this out or not, but here's a code snippet that works for me. In the following,
clusterer
is an hdbscan cluster object. The output is a list of indexes (starting at 0) of the exemplars for the specific cluster.selected_clusters = clusterer.condensed_tree_._select_clusters() raw_condensed_tree = clusterer.condensed_tree_._raw_tree exemplars = [] for cluster in selected_clusters: cluster_exemplars = np.array([], dtype=np.int64) for leaf in clusterer._prediction_data._recurse_leaf_dfs(cluster): leaf_max_lambda = raw_condensed_tree['lambda_val'][ raw_condensed_tree['parent'] == leaf].max() points = raw_condensed_tree['child'][ (raw_condensed_tree['parent'] == leaf) & (raw_condensed_tree['lambda_val'] == leaf_max_lambda)] cluster_exemplars = np.hstack([cluster_exemplars, points]) exemplars.append(cluster_exemplars)```
For anyone meet this error: AttributeError: 'NoneType' object has no attribute '_recurse_leaf_dfs'
, You can try add below code at the beginning of @jsgroob's code, it works for me :).
if clusterer._prediction_data is None:
clusterer.generate_prediction_data()
Hi, Using HDBSCAN to cluster network traffic data. I see that there is a way to retrieve the exemplar points representing the clusters. Would it be possible to instead return the indices for these points in the provided data, so that it is easier to correlate with raw data (I do some transformations to create numerical representations of hostnames). If there is already a method to do so, please let me know.