vtraag / leidenalg

Implementation of the Leiden algorithm for various quality functions to be used with igraph in Python.
GNU General Public License v3.0
566 stars 76 forks source link

Memory Error with Clustering with Leiden algorithm matrix - When to use matrix vs igraph method? #155

Closed WilliamMWei closed 6 months ago

WilliamMWei commented 8 months ago

Hi,

Thanks for the tool.

I attempted to cluster 45,000 cells using Leiden algorithm, using default argument method = "matrix". However, I encountered a "memory issue". But. when I changed `method = "igraph", it ran fine.

In the help, it mentions to use igraph method when we do not want to cast large dataset to dense matrix, so it seems it simply is to deal with large dataset. But, would you mind letting me know if there is other key difference between using igraph vs matrix methods in terms of the clustering results? And, when should I choose one vs the other?

Related post: https://github.com/scverse/scanpy/issues/1053 I have also posted here: https://github.com/satijalab/seurat/issues/7979

Thank you so much for your support!

 pbmc_cd4_cxcr5posneg.data_filtergene_filtercell_list_IndividualDatasetMERGED <- Seurat::FindClusters(pbmc_cd4_cxcr5posneg.data_filtergene_filtercell_list_IndividualDatasetMERGED, algorithm = 4, resolution = 1.2)
Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  MemoryError
Run `reticulate::py_last_error()` for details.
In addition: There were 12 warnings (use warnings() to see them)

> reticulate::py_last_error()

── Python Exception Message ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
MemoryError

── R Traceback ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     ▆
  1. ├─Seurat::FindClusters(...)
  2. └─Seurat:::FindClusters.Seurat(...)
  3.   ├─Seurat::FindClusters(...)
  4.   └─Seurat:::FindClusters.default(...)
  5.     └─Seurat:::RunLeiden(...)
  6.       ├─leiden::leiden(...)
  7.       └─leiden:::leiden.matrix(...)
  8.         ├─leiden:::make_py_graph(object, weights = weights)
  9.         └─leiden:::make_py_graph.matrix(object, weights = weights)
 10.           ├─leiden:::make_py_object(object, weights = weights)
 11.           └─leiden:::make_py_object.matrix(object, weights = weights)
 12.             └─adj_mat_py$tolist()
 13.               └─reticulate:::py_call_impl(callable, call_args$unnamed, call_args$named)
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustree_0.5.0     ggraph_2.1.0       ggplot2_3.4.4      reticulate_1.34.0  knitr_1.44         SeuratObject_4.1.4 Seurat_4.4.0      

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     rstudioapi_0.15.0      jsonlite_1.8.7         magrittr_2.0.3         spatstat.utils_3.0-3   farver_2.1.1          
  [7] rmarkdown_2.25         fs_1.6.3               vctrs_0.6.4            ROCR_1.0-11            memoise_2.0.1          spatstat.explore_3.2-5
 [13] rstatix_0.7.2          htmltools_0.5.6.1      usethis_2.2.2          broom_1.0.5            sctransform_0.4.1      parallelly_1.36.0     
 [19] KernSmooth_2.23-21     htmlwidgets_1.6.2      ica_1.0-3              plyr_1.8.9             plotly_4.10.3          zoo_1.8-12            
 [25] cachem_1.0.8           igraph_1.5.1           mime_0.12              lifecycle_1.0.3        pkgconfig_2.0.3        Matrix_1.6-1.1        
 [31] R6_2.5.1               fastmap_1.1.1          fitdistrplus_1.1-11    future_1.33.0          shiny_1.7.5.1          digest_0.6.33         
 [37] colorspace_2.1-0       patchwork_1.1.3        ps_1.7.5               rprojroot_2.0.3        tensor_1.5             irlba_2.3.5.1         
 [43] pkgload_1.3.3          ggpubr_0.6.0           labeling_0.4.3         progressr_0.14.0       fansi_1.0.5            spatstat.sparse_3.0-2 
 [49] httr_1.4.7             polyclip_1.10-6        abind_1.4-5            compiler_4.3.1         here_1.0.1             remotes_2.4.2.1       
 [55] withr_2.5.1            backports_1.4.1        viridis_0.6.4          carData_3.0-5          pkgbuild_1.4.2         ggforce_0.4.1         
 [61] ggsignif_0.6.4         MASS_7.3-60            rappdirs_0.3.3         sessioninfo_1.2.2      tools_4.3.1            lmtest_0.9-40         
 [67] httpuv_1.6.12          future.apply_1.11.0    goftest_1.2-3          glue_1.6.2             callr_3.7.3            nlme_3.1-162          
 [73] promises_1.2.1         grid_4.3.1             checkmate_2.2.0        Rtsne_0.16             cluster_2.1.4          reshape2_1.4.4        
 [79] generics_0.1.3         gtable_0.3.4           spatstat.data_3.0-3    tidyr_1.3.0            data.table_1.14.8      tidygraph_1.2.3       
 [85] sp_2.1-1               car_3.1-2              utf8_1.2.4             spatstat.geom_3.2-7    RcppAnnoy_0.0.21       ggrepel_0.9.4         
 [91] RANN_2.6.1             pillar_1.9.0           stringr_1.5.0          later_1.3.1            splines_4.3.1          tweenr_2.0.2          
 [97] dplyr_1.1.3            lattice_0.21-8         survival_3.5-5         deldir_1.0-9           tidyselect_1.2.0       miniUI_0.1.1.1        
[103] pbapply_1.7-2          gridExtra_2.3          scattermore_1.2        xfun_0.40              graphlayouts_1.0.1     devtools_2.4.5        
[109] matrixStats_1.0.0      stringi_1.7.12         lazyeval_0.2.2         yaml_2.3.7             evaluate_0.22          codetools_0.2-19      
[115] tibble_3.2.1           BiocManager_1.30.22    cli_3.6.1              uwot_0.1.16            xtable_1.8-4           munsell_0.5.0         
[121] processx_3.8.2         Rcpp_1.0.11            globals_0.16.2         spatstat.random_3.2-1  png_0.1-8              parallel_4.3.1        
[127] ellipsis_0.3.2         prettyunits_1.2.0      profvis_0.3.8          urlchecker_1.0.1       listenv_0.9.0          viridisLite_0.4.2     
[133] scales_1.2.1           ggridges_0.5.4         leiden_0.4.3           purrr_1.0.2            crayon_1.5.2           rlang_1.1.1           
[139] cowplot_1.1.1         

@denvercal1234GitHub

szhorvat commented 8 months ago

Using a matrix is not a feature of this library. It is entirely specific to the leiden R package, which will convert that matrix to a graph before doing any community detection.

Given what the leiden package does, the claim in Seurat's documentation that the "matrix" method is faster for small data seems rather strange ... maybe it has to do with inefficient transfer of data between R and Python.

denvercal1234 commented 8 months ago

Thanks @szhorvat -- just so I understand it correctly, did you mean specifying method="matrix" or method="igraph" does not really impact the resulting clusters but it is simply helpful for efficiency of how the data is processed before community detection? For example, with large dataset, specifying method=igraph will skip the conversion of the data to a dense matrix, thereby speeding up the whole clustering (community detection).

szhorvat commented 8 months ago

Presumably yes. But you need to discuss this with the packages that implemented these methods. This choice of methods does not come from the leidenalg Python package.

denvercal1234 commented 8 months ago

Thanks Szabolcs. Hopefully someone from Seurat will give some input.

SamGG commented 8 months ago

Probably @TomKellyGenetics could bring some clues.

My opinion: maybe it's time to use igraph directly. https://igraph.org/r/doc/cluster_leiden.html

TomKellyGenetics commented 8 months ago

My opinion: maybe it's time to use igraph directly. https://igraph.org/r/doc/cluster_leiden.html

@SamGG the leiden package already does this by default for igraph objects, although limited parameters are supported compared to calling Python. This has been supported for over a year with the 0.4 version.

maybe it has to do with inefficient transfer of data between R and Python

@szhorvat that’s correct it does (reticulate supports dense matrices but not sparse matrices or igraph objects so igraph objects are passed as an edge list and recreated in Python). This only applies to older versions of the R package for the reasons discussed above so the comment in Seurat documentation is likely no longer relevant for users running igraph 1.2.7 and leiden 0.4.0 or later.

SamGG commented 8 months ago

Thanks for this information and your feedback.

vtraag commented 6 months ago

Thanks all for commenting in my absence! I believe all questions are addressed, so I'm closing this.