mhahsler / dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
GNU General Public License v3.0
304 stars 64 forks source link

Possible Memory Leak #46

Closed mlinegar closed 2 years ago

mlinegar commented 3 years ago

I have been running into a segfault error when running hdbscan. I initially ran into the error when using the doc2vec library which calls hdbscan. I only run into the error when running on my full set of data (137649 rows, ~300mb), but not for a subset. The error still happens even if I try increasing minPts, or increasing the size of the server I am using (I have tried up to 600GB RAM).

Is there any way around this error? Please let me know if there's anything I can do to help debug!

library(doc2vec)
# download sample file - note: file is ~300mb
utils::download.file("https://www.dropbox.com/s/geer73bjp936gaw/gdelt_seg_d2v.bin?dl=1", "temp.bin")
d2v <- read.paragraph2vec(file = "temp.bin")
emb <- as.matrix(d2v)
embedding_umap   <- uwot::tumap(emb , n_neighbors = 100L, n_components = 2, metric = "cosine")
thisfails <- dbscan::hdbscan(embedding_umap , minPts = 25)

Here is the output of sessionInfo():

Matrix products: default
BLAS: /software/free/R/R-4.0.0/lib/R/lib/libRblas.so
LAPACK: /software/free/R/R-4.0.0/lib/R/lib/libRlapack.so

Random number generation:
RNG: L'Ecuyer-CMRG
Normal: Inversion
Sample: Rejection

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ranger_0.12.1 vctrs_0.3.7 rlang_0.4.10
[4] mosaicCore_0.9.0 yardstick_0.0.8 workflowsets_0.0.2
[7] workflows_0.2.2 tune_0.1.5 tidyr_1.1.3
[10] tibble_3.1.1 rsample_0.0.9 recipes_0.1.16
[13] purrr_0.3.4 parsnip_0.1.5 modeldata_0.1.0
[16] infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.5
[19] dials_0.0.9 scales_1.1.1 broom_0.7.6
[22] tidymodels_0.1.3 lubridate_1.7.10 gsubfn_0.7
[25] proto_1.0.0 data.table_1.13.6 dbscan_1.1-8
[28] uwot_0.1.10 Matrix_1.3-2 stringr_1.4.0
[31] doc2vec_0.2.0 futile.logger_1.4.3

loaded via a namespace (and not attached):
[1] splines_4.0.0 foreach_1.5.1 here_0.1
[4] prodlim_2019.11.13 assertthat_0.2.1 conflicted_1.0.4
[7] GPfit_1.0-8 globals_0.14.0 ipred_0.9-11
[10] pillar_1.6.0 backports_1.2.0 lattice_0.20-41
[13] glue_1.4.2 pROC_1.17.0.1 digest_0.6.27
[16] pryr_0.1.4 hardhat_0.1.5 colorspace_2.0-0
[19] plyr_1.8.6 timeDate_3043.102 pkgconfig_2.0.3
[22] lhs_1.1.1 DiceDesign_1.9 listenv_0.8.0
[25] RSpectra_0.16-0 gower_0.2.2 lava_1.6.9
[28] generics_0.1.0 ellipsis_0.3.1 withr_2.3.0
[31] furrr_0.2.2 nnet_7.3-14 cli_2.4.0
[34] survival_3.2-7 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 ps_1.4.0 fansi_0.4.1
[40] future_1.21.0 parallelly_1.24.0 MASS_7.3-53
[43] class_7.3-17 tools_4.0.0 formatR_1.7
[46] lifecycle_1.0.0 munsell_0.5.0 lambda.r_1.2.4
[49] compiler_4.0.0 grid_4.0.0 rstudioapi_0.13
[52] iterators_1.0.13 RcppAnnoy_0.0.18 gtable_0.3.0
[55] codetools_0.2-18 DBI_1.1.0 R6_2.5.0
[58] utf8_1.1.4 rprojroot_1.3-2 futile.options_1.0.1
[61] stringi_1.5.3 parallel_4.0.0 Rcpp_1.0.6
[64] rpart_4.1-15 tidyselect_1.1.0
mhahsler commented 3 years ago

Hi, thank you for your report. hdbscan calculates a distance matrix of size n^2. For your data, this is in GB

137649^2 * 8 / 2^30 [1] 141.168

Maybe your R process is not allowed to allocate that much?

mhahsler commented 2 years ago

Update: I have updated the hdbscan code to work with long vectors. The version on GitHub might address your problem.