Closed cmalzer closed 1 month ago
@cmalzer Thank you for adapting the code. I tried the following:
library(dbscan)
data(DS3)
cl <- hdbscan(DS3, minPts = 20, cluster_selection_epsilon = 20)
plot(cl)
plot(DS3, col = cl$cluster+1)
Should this produce only a single large cluster by deselecting all smaller clusters?
Regards, Michael
Hi Michael,
this depends on whether the implementation allows the selection of the root node. Often this isn’t intended because returning the root node basically means that there are no actual clusters. The original HDBSCAN(e) also does not consider selection of the root node, but of course you could just do it anyway - and in your example, users might even expect this to happen (although it should be noted that cluster_selection_epsilon is usually a very small value that should never reach the height of the root).
In buildHDBSCAN.cpp
, the line if (!keep_children && cid != "0")
apparently is the reason why the root node (=0) is not returned, i.e. the current version does not support returning a single cluster.
In the Python version of HDBSCAN, there is a specific parameter allow_single_cluster
which is set to False by default, but can also set to True if needed. Maybe this would also be an option for this repository?
Regards, Claudia
@cmalzer I think that is fine as long as we put it into the manpage. If I change the parameter in the following way:
library(dbscan)
data(DS3)
cl <- hdbscan(DS3, minPts = 20, cluster_selection_epsilon = 15)
plot(cl)
plot(DS3, col = cl$cluster+1)
There are three nodes below an eps of 15, but the clustering returns 5 clusters. Did I misunderstand something.
Would the flat clustering not be equivalent to
cl <- dbscan(DS3, minPts = 20, eps = 15, borderPoints = FALSE)
plot(DS3, col = cl$cluster+1)
Yes, you are right indeed! I just committed a bugfix for this. Problem was that the first nodes below epsilon were not de-selected, only the subtree below that. I replaced epsilon_birth by epsilon_death, which is more relevant for a top-down approach because it checks the level of the child nodes rather then the level of the current node and we can then set "keep_children=False" if that level lies below epsilon. Please let me know if there still are any issues. If you would like to, I can also add a short explanation of the parameter in the README.
Hi Claudia. Thanks for the fix.
Best, Michael
Thanks, I just committed the changes (apparently my IDE automatically re-formatted the code/ added some brackets in hdbscan.R but I guess this should not matter).
Best regards, Claudia
This adds the cluster_selection_epsilon parameter as requested in Issue 56 .
Apparently, FOSC is implemented in this repository as a recursive top-down approach where the tree is pruned (keep_children=False) if stability prefers the parent node. So, I just added a simple epsilon check which sets keep_children to false if eps_birth is smaller than cluster_selection_epsilon. This should still return the most stable EOM clusters for parts of tree not affected by epsilon, which is important to achieve a hybrid version instead of just extracting DBSCAN* clusters (please double-check the changes to make sure I didn't miss anything!).
A simple test (note that this isn't the best demonstration dataset as you could just use dbscan instead of HDBSCAN(e) here, this is just for testing that the implementation works in general):