Extend to weighted DBSCAN and some implementation suggestions

scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.

BSD 3-Clause "New" or "Revised" License

2.79k stars 500 forks source link

 Clustering a weighted data set D: every d in D; w(d) ≥ 0; ∑ w(d) = 1. The

number of D(i.e. |D|) is n. Traditional clustering algorithms can be readily translated into the weighted setting by considering their behavior on data set containing element duplicates. In weighted setting, it’s equivalent to say there are w(d)×n points in the location of d. For weighted DBSCAN, only a slight modification is added to knns finders(e.g. k-d tree) by allowing duplicates : When traversing the kd-tree the algorithm still maintains k current nearest points in a set, but every time the algorithm finds a point di nearer than some points in the set, it actually finds w(di)n nearer points in weighted setting. The algorithm replace all points farther than new points in the set with new ones. As a result, there might be duplicates in k nearest neighbors of d. Other steps of the algorithm remain unchanged. I suggest the weights input should be w(d)n, since for large n, w(d) might be very small leading numerical underflow. w(d)n is almost O(1). For fixed radius, if you have already provide a function to count the number of points lying within the core distance of a point d. The algorithm can sum up w(di)n within the core distance, if the sum > k, d is a core point; if the sum < k, d is not. I also recommend you add a DBSCAN in your implementation. Since in some situations, we only need one cluster level, in prior knowledge of uniform density of data. DBSCAN can be faster than any implementation if using your technique. Moreover, you can show the difference between DBSCAN* and HDBSCAN, how better HDBSCAN is in a nonuniform density situation! Your implementation will be more complete!

I've been comparing HDBSCAN with DBSCAN for a specific use case on a dataset I've been working on, which got me thinking about your second suggestion.

1) You might not want this (due to your nonuniform density comment), but you would need to use the metric distance rather than the transformed mutual reachability distance; that should just amount to adding an 'if' statement in the source. 1b) You can somewhat force HDBSCAN to do this by setting very low min_samples values; however, this might not fit your desired k/min_pts for DBSCAN (the wiki recommends setting min_pts > 2 to avoid precisely this, although I'm not sure that's the 'right' justification),

2) This is probably more like what you had in mind: You can output the single_linkage_tree.to_scipy, and then use the scipy.cluster.hierarchy cut_tree command to cut the dendrogram at your specified height value. Similarly, if your use-case allows you to set a number of desired clusters/partitions, you can use the same function for that. I suppose this could be added as a feature to HDBSCAN, but all the structures are easily accessible already...

I might be missing something, though. I'm not at work this week, so I haven't had a chance to test it out.

scikit-learn-contrib / hdbscan

Extend to weighted DBSCAN and some implementation suggestions #101