satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 914 forks source link

Discrepancy between "constant resolution" and same value from "sequential resolution" at FindClusters() function #6788

Closed rekren closed 1 year ago

rekren commented 1 year ago

Hi there,

While comparing the cluster assignments of FindClusters( ..., resolution = 0.5) and the seurat_object$RNA_snn_res.0.5 obtained from FindClusters( ..., resolution = seq(0.1,1,0.1)) I noticed something weird. These cluster assignments are not overlapping. Even though the random seed is equally set.

A: Sequential determination of clusters for resolution 0.1 to 1, with the increment of 0.1

seurat_object <- seurat_object %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData()  %>% RunPCA()
seurat_object <- RunHarmony(object = seurat_object,group.by.vars = "id")
seurat_object <- FindNeighbors(seurat_object,reduction = "harmony",dims = 1:20,force.recalc = T,assay = "RNA")
seurat_object <- FindClusters(object = seurat_object,algorithm = "leiden",method = "igraph",resolution = seq(0.1,1,0.1),random.seed = 2206,verbose = T )

B: Determination of clusters for resolution 0.5, directly.

seurat_object <- seurat_object %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData()  %>% RunPCA()
seurat_object <- RunHarmony(object = seurat_object,group.by.vars = "id")
seurat_object <- FindNeighbors(seurat_object,reduction = "harmony",dims = 1:20,force.recalc = T,assay = "RNA")
seurat_object <- FindClusters(object = seurat_object,algorithm = "leiden",method = "igraph",resolution= 0.5,random.seed = 2206,verbose = T )
#resolution = seq(0.1,1,0.1)

Here is a visual comparison of how the cells are assigned differently, between directly determining clusters for resolution = 0.5 and then calling the 0.5 assignments from a serial of cluster assignments. image

 sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_FR.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggalluvial_0.12.3  forcats_0.5.2      stringr_1.5.0      dplyr_1.0.10       purrr_0.3.5        readr_2.1.3        tidyr_1.2.1       
 [8] tibble_3.1.8       tidyverse_1.3.2    clustree_0.5.0     ggraph_2.1.0       ggplot2_3.4.0      harmony_0.1.1      Rcpp_1.0.9        
[15] SeuratObject_4.1.3 Seurat_4.3.0       qs_0.25.4         

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                     spatstat.explore_3.0-5         reticulate_1.26-9000           tidyselect_1.2.0              
  [5] htmlwidgets_1.5.4              grid_4.2.2                     Rtsne_0.16                     munsell_0.5.0                 
  [9] codetools_0.2-18               ica_1.0-3                      future_1.29.0                  miniUI_0.1.1.1                
 [13] withr_2.5.0                    spatstat.random_3.0-1          colorspace_2.0-3               progressr_0.12.0              
 [17] Biobase_2.58.0                 rstudioapi_0.14                stats4_4.2.2                   SingleCellExperiment_1.20.0   
 [21] ROCR_1.0-11                    ggsignif_0.6.4                 tensor_1.5                     bmcite.SeuratData_0.3.0       
 [25] listenv_0.8.0                  MatrixGenerics_1.10.0          labeling_0.4.2                 GenomeInfoDbData_1.2.9        
 [29] polyclip_1.10-4                farver_2.1.1                   pheatmap_1.0.12                rprojroot_2.0.3               
 [33] parallelly_1.33.0              vctrs_0.5.1                    generics_0.1.3                 dittoSeq_1.9.3                
 [37] timechange_0.1.1               R6_2.5.1                       GenomeInfoDb_1.34.3            graphlayouts_0.8.4            
 [41] ggbeeswarm_0.6.0               bitops_1.0-7                   spatstat.utils_3.0-1           DelayedArray_0.24.0           
 [45] assertthat_0.2.1               promises_1.2.0.1               scales_1.2.1                   googlesheets4_1.0.1           
 [49] beeswarm_0.4.0                 gtable_0.3.1                   globals_0.16.2                 goftest_1.2-3                 
 [53] tidygraph_1.2.2                rlang_1.0.6                    GlobalOptions_0.1.2            splines_4.2.2                 
 [57] rstatix_0.7.1                  lazyeval_0.2.2                 gargle_1.2.1                   checkmate_2.1.0               
 [61] spatstat.geom_3.0-3            broom_1.0.1                    modelr_0.1.10                  reshape2_1.4.4                
 [65] abind_1.4-5                    backports_1.4.1                httpuv_1.6.7                   SCpubr_1.0.4                  
 [69] tools_4.2.2                    SeuratData_0.2.2               ellipsis_0.3.2                 RColorBrewer_1.1-3            
 [73] bonemarrowref.SeuratData_1.0.0 BiocGenerics_0.44.0            ggridges_0.5.4                 plyr_1.8.8                    
 [77] zlibbioc_1.44.0                RCurl_1.98-1.9                 ggpubr_0.5.0                   deldir_1.0-6                  
 [81] pbapply_1.6-0                  viridis_0.6.2                  cowplot_1.1.1                  S4Vectors_0.36.0              
 [85] zoo_1.8-11                     haven_2.5.1                    SummarizedExperiment_1.28.0    ggrepel_0.9.2                 
 [89] cluster_2.1.4                  fs_1.5.2                       here_1.0.1                     magrittr_2.0.3                
 [93] data.table_1.14.6              scattermore_0.8                circlize_0.4.15                reprex_2.0.2                  
 [97] lmtest_0.9-40                  RANN_2.6.1                     googledrive_2.0.0              fitdistrplus_1.1-8            
[101] matrixStats_0.63.0             stringfish_0.15.7              hms_1.1.2                      patchwork_1.1.2               
[105] mime_0.12                      xtable_1.8-4                   readxl_1.4.1                   IRanges_2.32.0                
[109] gridExtra_2.3                  shape_1.4.6                    compiler_4.2.2                 colorway_0.2.0                
[113] crayon_1.5.2                   KernSmooth_2.23-20             htmltools_0.5.4                tzdb_0.3.0                    
[117] later_1.3.0                    ggprism_1.0.4                  scCustomize_0.7.0              RcppParallel_5.1.5            
[121] lubridate_1.9.0                DBI_1.1.3                      RApiSerialize_0.1.2            tweenr_2.0.2                  
[125] dbplyr_2.2.1                   rappdirs_0.3.3                 MASS_7.3-58.1                  Matrix_1.5-3                  
[129] car_3.1-1                      cli_3.4.1                      parallel_4.2.2                 igraph_1.3.5                  
[133] GenomicRanges_1.50.1           pkgconfig_2.0.3                sp_1.5-1                       plotly_4.10.1                 
[137] spatstat.sparse_3.0-0          xml2_1.3.3                     paletteer_1.5.0                vipor_0.4.5                   
[141] XVector_0.38.0                 rvest_1.0.3                    snakecase_0.11.0               digest_0.6.31                 
[145] sctransform_0.3.5              RcppAnnoy_0.0.20               janitor_2.1.0                  spatstat.data_3.0-0           
[149] cellranger_1.1.0               leiden_0.4.3                   uwot_0.1.14                    pbmcref.SeuratData_1.0.0      
[153] shiny_1.7.3                    lifecycle_1.0.3                nlme_3.1-161                   jsonlite_1.8.4                
[157] carData_3.0-5                  viridisLite_0.4.1              fansi_1.0.3                    pillar_1.8.1                  
[161] lattice_0.20-45                fastmap_1.1.0                  httr_1.4.4                     survival_3.4-0                
[165] glue_1.6.2                     png_0.1-8                      ggforce_0.4.1                  stringi_1.7.8                 
[169] rematch2_2.1.2                 renv_0.16.0                    irlba_2.3.5.1                  future.apply_1.10.0         
samuel-marsh commented 1 year ago

Hi,

Not member of dev team but curious. Does this effect occur when using louvain algorithm for clustering or just when using Leiden?

Best, Sam

rekren commented 1 year ago

Hi Samuel,

Thanks for your interest. To address your curiosity, I also run;

FindClusters(object = seurat_object, algorithm =1,resolution= 0.5, random.seed = 2206,verbose = T ) 

and

FindClusters(object = seurat_object, algorithm =1,resolution= seq(0.1,1,0.1), random.seed = 2206,verbose = T )

For the resolution 0.5, the results are not overlapping, again.

P.S: For default Louvain parameters, cluster names start from "0" if you give constant resolution, but if it's a sequence of values, cluster names start from "1".

You can try with your data, too.image

samuel-marsh commented 1 year ago

Hi,

So just quick look at things myself and at least with basic attempt I can't seem to replicate the issue. Can you provide full code for reprex with example data? Below is what I ran and every time the results of identical() were TRUE.

Best, Sam

pbmc <- pbmc3k.SeuratData::pbmc3k

pbmc <- pbmc %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData()  %>% RunPCA() %>% FindNeighbors()

pbmc_val <- pbmc
pbmc_val_igraph <- pbmc
pbmc_seq <- pbmc
pbmc_seq_igraph <- pbmc

pbmc_val <- FindClusters(object = pbmc_val, resolution = 0.5)
pbmc_seq <- FindClusters(object = pbmc_seq, resolution = seq(0.1, 1,0.1))

identical(pbmc_val@meta.data$RNA_snn_res.0.5, pbmc_seq@meta.data$RNA_snn_res.0.5)

pbmc_val_igraph <- FindClusters(object = pbmc_val, resolution = 0.5, method = "igraph")
pbmc_seq_igraph <- FindClusters(object = pbmc_seq, resolution = seq(0.1, 1,0.1), method = "igraph")

identical(pbmc_val_igraph@meta.data$RNA_snn_res.0.5, pbmc_seq_igraph@meta.data$RNA_snn_res.0.5)

# Using harmony
pbmc_har <- pbmc3k.SeuratData::pbmc3k

pbmc_har$id <- sample(c("sample1", "sample2", "sample3", "sample4", "sample5", "sample6"), size = ncol(pbmc_har),
                         replace = TRUE)

pbmc_har <- pbmc_har %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData()  %>% RunPCA()

pbmc_har <- harmony::RunHarmony(object = pbmc_har, group.by.vars = "id")

pbmc_har <- FindNeighbors(pbmc_har,reduction = "harmony",dims = 1:20, ,assay = "RNA", force.recalc = T)

pbmc_har_val <- pbmc_har
pbmc_har_seq <- pbmc_har

pbmc_har_val <- FindClusters(object = pbmc_har_val, resolution = 0.5)
pbmc_har_seq <- FindClusters(object = pbmc_har_seq, resolution = seq(0.1, 1,0.1))

identical(pbmc_har_val@meta.data$RNA_snn_res.0.5, pbmc_har_seq@meta.data$RNA_snn_res.0.5)
rekren commented 1 year ago

Hello,

Thanks for the more concise, reproducible example.

I have also run this example and obtained the output of identical(pbmc_seq,pbmc_val) was TRUE. However, In my real dataset, the output of identical() was FALSE.

The example dataset "pbmc3k" has 3k cells, my real dataset has 35k cells. Maybe the cell count of the Seurat object being greater than 30k(?), somehow, affects this outcome. So, I run both clustering approaches to the "hcabm40k" dataset containing 40k cells in it.

SeuratData::InstallData("hcabm40k.SeuratData")
hcabm40k <- hcabm40k.SeuratData::hcabm40k

The output of identical(hcabm40k_seq,hcabm40k_val) was again TRUE for the bigger dataset. I was quite puzzled.

Finally, after meticulous searches, a bunch of reruns, blood, and tears; I noticed I was comparing older iterations of the analysis file with the newer "practice". Bad practice of saving non-time-stamped analysis files.

Thanks for your support in this endeavor of troubleshooting this "bug" (!)

Best, Rüçhan.