mhahsler / dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
GNU General Public License v3.0
304 stars 64 forks source link

Add cluster_selection_epsilon #76

Closed cmalzer closed 1 month ago

cmalzer commented 1 month ago

This adds the cluster_selection_epsilon parameter as requested in Issue 56 .

Apparently, FOSC is implemented in this repository as a recursive top-down approach where the tree is pruned (keep_children=False) if stability prefers the parent node. So, I just added a simple epsilon check which sets keep_children to false if eps_birth is smaller than cluster_selection_epsilon. This should still return the most stable EOM clusters for parts of tree not affected by epsilon, which is important to achieve a hybrid version instead of just extracting DBSCAN* clusters (please double-check the changes to make sure I didn't miss anything!).

A simple test (note that this isn't the best demonstration dataset as you could just use dbscan instead of HDBSCAN(e) here, this is just for testing that the implementation works in general):

library(dbscan)
library(factoextra) # for multishapes

x <- multishapes[, 1:2]

set.seed(123)
mintPts <- 4

hdb <- hdbscan(x, minPts = mintPts, verbose=TRUE)
plot(hdb, show_flat = TRUE)
print(hdb)
plot(x, col=hdb$cluster+1, main="HDBSCAN")

hdbe <- hdbscan(x, minPts = mintPts, cluster_selection_epsilon=0.2, verbose=TRUE)
plot(hdbe, show_flat = TRUE)
print(hdbe)
plot(x, col=hdbe$cluster+1, main="HDBSCAN(e)")
mhahsler commented 1 month ago

@cmalzer Thank you for adapting the code. I tried the following:

library(dbscan)
data(DS3)
cl <- hdbscan(DS3, minPts = 20, cluster_selection_epsilon = 20)
plot(cl)
plot(DS3, col = cl$cluster+1)

Should this produce only a single large cluster by deselecting all smaller clusters?

Regards, Michael

cmalzer commented 1 month ago

Hi Michael,

this depends on whether the implementation allows the selection of the root node. Often this isn’t intended because returning the root node basically means that there are no actual clusters. The original HDBSCAN(e) also does not consider selection of the root node, but of course you could just do it anyway - and in your example, users might even expect this to happen (although it should be noted that cluster_selection_epsilon is usually a very small value that should never reach the height of the root).

In buildHDBSCAN.cpp, the line if (!keep_children && cid != "0") apparently is the reason why the root node (=0) is not returned, i.e. the current version does not support returning a single cluster.

In the Python version of HDBSCAN, there is a specific parameter allow_single_cluster which is set to False by default, but can also set to True if needed. Maybe this would also be an option for this repository?

Regards, Claudia

mhahsler commented 1 month ago

@cmalzer I think that is fine as long as we put it into the manpage. If I change the parameter in the following way:

library(dbscan)
data(DS3)
cl <- hdbscan(DS3, minPts = 20, cluster_selection_epsilon = 15)
plot(cl)
plot(DS3, col = cl$cluster+1)

There are three nodes below an eps of 15, but the clustering returns 5 clusters. Did I misunderstand something.

Would the flat clustering not be equivalent to

cl <- dbscan(DS3, minPts = 20, eps = 15, borderPoints = FALSE)
plot(DS3, col = cl$cluster+1)
cmalzer commented 1 month ago

Yes, you are right indeed! I just committed a bugfix for this. Problem was that the first nodes below epsilon were not de-selected, only the subtree below that. I replaced epsilon_birth by epsilon_death, which is more relevant for a top-down approach because it checks the level of the child nodes rather then the level of the current node and we can then set "keep_children=False" if that level lies below epsilon. Please let me know if there still are any issues. If you would like to, I can also add a short explanation of the parameter in the README.

mhahsler commented 1 month ago

Hi Claudia. Thanks for the fix.

Best, Michael

cmalzer commented 1 month ago

Thanks, I just committed the changes (apparently my IDE automatically re-formatted the code/ added some brackets in hdbscan.R but I guess this should not matter).

Best regards, Claudia