mhahsler / dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
GNU General Public License v3.0
310 stars 64 forks source link

HDBSCAN parameters #24

Closed kbarylyuk closed 6 years ago

kbarylyuk commented 6 years ago

Hi,

I would like to have access to _minsamples and _cluster_selectionmethod tunable parameters of the hdbscan function.

In the SciKit-learn docs (https://hdbscan.readthedocs.io/en/latest/parameter_selection.html) that the HDBSCAN vignette refers to, there is a chapter on parameter selection for HDBSCAN. While the current implementation of HDBSCAN in the dbscan package for R has only one tunable parameter, minPts, more parameters (including _minsamples and _cluster_selectionmethod) are described by the chapter. One scenario that the chapter describes in relation to the _cluster_selectionmethod is:

If you are more interested in having small homogeneous clusters then you may find Excess of Mass has a tendency to pick one or two large clusters and then a number of small extra clusters. In this situation you may be tempted to recluster just the data in the single large cluster. Instead, a better option is to select 'leaf' as a cluster selection method.

This is very similar to what I get with my data (the dimensionality is roughly 4000-by-40): I obtain several smaller clusters (which are better separated) and one "mega-cluster".

rplot

I am quite certain that the "mega-cluster" has some meaningful structure within it that I would like to have resolved. From what I read in the SciKit-learn docs chapter, it seems possible to achieve this by tuning those other parameters, particularly, the _cluster_selectionmethod. Is there a way to control and input explicit values to _minsamples and _cluster_selectionmethod parameters in the current hdbscan function from the dbscan package for R, or would it be possible to add this feature? Thank you.

My R session info:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pRolocGUI_1.11.2     RColorBrewer_1.1-2   ggplot2_2.2.1        dbscan_1.1-1         pRoloc_1.19.1       
 [6] MLInterfaces_1.56.0  cluster_2.0.6        annotate_1.54.0      XML_3.98-1.10        AnnotationDbi_1.38.2
[11] IRanges_2.10.5       S4Vectors_0.14.7     MSnbase_2.2.0        ProtGenerics_1.8.0   BiocParallel_1.10.1 
[16] mzR_2.10.0           Rcpp_0.12.15         Biobase_2.36.2       BiocGenerics_0.22.1 

loaded via a namespace (and not attached):
  [1] plyr_1.8.4            igraph_1.1.2          lazyeval_0.2.1        splines_3.4.2        
  [5] ggvis_0.4.3           crosstalk_1.0.0       digest_0.6.15         foreach_1.4.4        
  [9] BiocInstaller_1.26.1  htmltools_0.3.6       viridis_0.5.0         gdata_2.18.0         
 [13] magrittr_1.5          memoise_1.1.0         doParallel_1.0.11     sfsmisc_1.1-1        
 [17] limma_3.32.10         recipes_0.1.2         gower_0.1.2           rda_1.0.2-2          
 [21] dimRed_0.1.0          lpSolve_5.6.13        colorspace_1.3-2      blob_1.1.0           
 [25] dplyr_0.7.4           RCurl_1.95-4.10       hexbin_1.27.2         genefilter_1.58.1    
 [29] bindr_0.1             impute_1.50.1         survival_2.41-3       iterators_1.0.9      
 [33] glue_1.2.0            DRR_0.0.3             gtable_0.2.0          ipred_0.9-6          
 [37] zlibbioc_1.22.0       kernlab_0.9-25        ddalpha_1.3.1.1       prabclus_2.2-6       
 [41] DEoptimR_1.0-8        scales_0.5.0          vsn_3.44.0            mvtnorm_1.0-7        
 [45] DBI_0.7               viridisLite_0.3.0     xtable_1.8-2          foreign_0.8-69       
 [49] bit_1.1-12            proxy_0.4-21          mclust_5.4            preprocessCore_1.38.1
 [53] DT_0.4                lava_1.6              prodlim_1.6.1         htmlwidgets_1.0      
 [57] sampling_2.8          threejs_0.3.1         FNN_1.1               fpc_2.1-11           
 [61] modeltools_0.2-21     pkgconfig_2.0.1       flexmix_2.3-14        nnet_7.3-12          
 [65] caret_6.0-78          labeling_0.3          tidyselect_0.2.3      rlang_0.2.0          
 [69] reshape2_1.4.3        munsell_0.4.3         mlbench_2.1-1         tools_3.4.2          
 [73] RSQLite_2.0           pls_2.6-0             broom_0.4.3           stringr_1.3.0        
 [77] yaml_2.1.16           mzID_1.14.0           ModelMetrics_1.1.0    knitr_1.20           
 [81] bit64_0.9-7           robustbase_0.92-8     randomForest_4.6-12   purrr_0.2.4          
 [85] dendextend_1.7.0      bindrcpp_0.2          nlme_3.1-131.1        whisker_0.3-2        
 [89] mime_0.5              RcppRoll_0.2.2        biomaRt_2.32.1        compiler_3.4.2       
 [93] e1071_1.6-8           affyio_1.46.0         tibble_1.4.2          stringi_1.1.6        
 [97] lattice_0.20-35       trimcluster_0.1-2     Matrix_1.2-12         psych_1.7.8          
[101] gbm_2.1.3             pillar_1.1.0          MALDIquant_1.17       bitops_1.0-6         
[105] httpuv_1.3.5          R6_2.2.2              pcaMethods_1.68.0     affy_1.54.0          
[109] hwriter_1.3.2         gridExtra_2.3         codetools_0.2-15      MASS_7.3-48          
[113] gtools_3.5.0          assertthat_0.2.0      CVST_0.2-1            withr_2.1.1          
[117] mnormt_1.5-5          diptest_0.75-7        grid_3.4.2            rpart_4.1-12         
[121] timeDate_3043.102     tidyr_0.8.0           class_7.3-14          Rtsne_0.13           
[125] shiny_1.0.5           lubridate_1.7.2       base64enc_0.1-3      
peekxc commented 6 years ago

Hi @kbarylyuk,

Regarding the min_samples parameter:

The approach taken by the package right now is a bit different from SciKit's. In short, you are right, there are two 'min*' parameters associated with HDBSCAN. The first is minPts, which determines how the HDBSCAN hierarchy is made, i.e.

x <- as.matrix(iris[, 1:4])
cl <- dbscan::hdbscan(x, minPts = 5L)

The second, what you refer too as _minsamples, is set by default to whatever the initial setting of minPts was, i.e. min_samples = minPts, however it can be given an alternative setting as follows:

dbscan::extractFOSC(cl$hc, minPts = <min_cluster_size>)

I recommend perusing the members of the cl object returned by HDBSCAN. It contains several members which may be useful. One of which is the hc element, which is an hclust object representing the HDBSCAN hierarchy. This hierarchy is what is parsed through via extractFOSC to determine the resulting clustering.

Regarding the cluster selection option:

As of right now, hdbscan only supports optimizing the excess of mass functional. I'm not sure what other cluster selection methods are in the Python one. However, it's worth noting that the hc element that was used above is indeed a valid hclust object. hclust objects natively have conversions to dendrogram objects as well, so, any tools you find that work off of either hclust or dendrogram objects in the R world you can use w/ HDBSCAN as well. For example, you can cut the tree like you might do with other hierarchical clustering algorithms (see ?stats::cutree), you can use the few cluster validation indices built for hierarchical clustering, and you can do a lot of dendrogram enhancements and statistics with packages like dendextend

kbarylyuk commented 6 years ago

Hi @peekxc,

thank you very much for these suggestions. I'll explore these options and see if they allow me to do what I want.