Closed kbarylyuk closed 6 years ago
Hi @kbarylyuk,
Regarding the min_samples
parameter:
The approach taken by the package right now is a bit different from SciKit's. In short, you are right, there are two 'min*' parameters associated with HDBSCAN. The first is minPts, which determines how the HDBSCAN hierarchy is made, i.e.
x <- as.matrix(iris[, 1:4])
cl <- dbscan::hdbscan(x, minPts = 5L)
The second, what you refer too as _minsamples, is set by default to whatever the initial setting of minPts
was, i.e. min_samples = minPts,
however it can be given an alternative setting as follows:
dbscan::extractFOSC(cl$hc, minPts = <min_cluster_size>)
I recommend perusing the members of the cl
object returned by HDBSCAN. It contains several members which may be useful. One of which is the hc
element, which is an hclust
object representing the HDBSCAN hierarchy. This hierarchy is what is parsed through via extractFOSC to determine the resulting clustering.
Regarding the cluster selection
option:
As of right now, hdbscan
only supports optimizing the excess of mass functional. I'm not sure what other cluster selection methods are in the Python one. However, it's worth noting that the hc
element that was used above is indeed a valid hclust
object. hclust
objects natively have conversions to dendrogram
objects as well, so, any tools you find that work off of either hclust
or dendrogram
objects in the R world you can use w/ HDBSCAN as well. For example, you can cut the tree like you might do with other hierarchical clustering algorithms (see ?stats::cutree
), you can use the few cluster validation indices built for hierarchical clustering, and you can do a lot of dendrogram enhancements and statistics with packages like dendextend
Hi @peekxc,
thank you very much for these suggestions. I'll explore these options and see if they allow me to do what I want.
Hi,
I would like to have access to _minsamples and _cluster_selectionmethod tunable parameters of the
hdbscan
function.In the SciKit-learn docs (https://hdbscan.readthedocs.io/en/latest/parameter_selection.html) that the HDBSCAN vignette refers to, there is a chapter on parameter selection for HDBSCAN. While the current implementation of HDBSCAN in the dbscan package for R has only one tunable parameter, minPts, more parameters (including _minsamples and _cluster_selectionmethod) are described by the chapter. One scenario that the chapter describes in relation to the _cluster_selectionmethod is:
This is very similar to what I get with my data (the dimensionality is roughly 4000-by-40): I obtain several smaller clusters (which are better separated) and one "mega-cluster".
I am quite certain that the "mega-cluster" has some meaningful structure within it that I would like to have resolved. From what I read in the SciKit-learn docs chapter, it seems possible to achieve this by tuning those other parameters, particularly, the _cluster_selectionmethod. Is there a way to control and input explicit values to _minsamples and _cluster_selectionmethod parameters in the current
hdbscan
function from the dbscan package for R, or would it be possible to add this feature? Thank you.My R session info: