satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 915 forks source link

Using the graph.name argument generates clustering that is different when not using the graph.name argument? #4155

Closed Pancreas-Pratik closed 3 years ago

Pancreas-Pratik commented 3 years ago

Thank you Satija Lab for the incredible work that has been done with Seurat! You guys are miracle workers!

Here is the bug:

I am using the pbm3k dataset from SeuratData as the reproducible example.

I think there is a bug when using the graph.name argument. Please correct me if I'm wrong.

Simply using the graph.name argument generates clustering that is different when not using the graph.name argument. (either more or less clusters, more or less cells in clusters)

Similar clustering differences happen when using the default algorithm for FindClusters as well. Here I used algorithm 4.

I thought specifying a graph.name should keep everything consistent. I actually prefer the clustering generated with the graph.name argument included in my dataset. I'm wondering what is happening.

Here is what was done:

library(dplyr)
library(Seurat)
library(patchwork)

# Load the PBMC dataset
SeuratData::InstallData("pbmc3k")
pbmc <- SeuratData::LoadData("pbmc3k")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
#Without graph.name argument
#pbmc <- FindNeighbors(pbmc, graph.name ="test", dims = 1:10)
#pbmc <- FindClusters(pbmc, graph.name = "test", algorithm = 4, resolution = 0.5)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, algorithm = 4, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 6)

Here is the full workflow for with the graph.name argument:

> pbmc <- SeuratData::LoadData("pbmc3k")
> # Initialize the Seurat object with the raw (non-normalized data).
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
> pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> all.genes <- rownames(pbmc)
> pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
  |============================================================================================| 100%
> pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
       FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
       PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
       CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
       MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
       HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
       BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
       CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
       TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
PC_ 3 
Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
       HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
       PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
       HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
       NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
PC_ 4 
Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
       SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
       GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
       AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
       LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
PC_ 5 
Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
       GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
       RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
       PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
       FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB 
> pbmc <- FindNeighbors(pbmc, graph.name ="test", dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> pbmc <- FindClusters(pbmc, graph.name = "test", algorithm = 4, resolution = 0.5)
> pbmc <- RunUMAP(pbmc, dims = 1:10)
15:37:13 UMAP embedding parameters a = 0.9922 b = 1.112
15:37:13 Read 2638 rows and found 10 numeric columns
15:37:13 Using Annoy for neighbor search, n_neighbors = 30
15:37:13 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:14 Writing NN index file to temp file /tmp/Rtmp66O0pD/filee2d579073b5
15:37:14 Searching Annoy index using 1 thread, search_k = 3000
15:37:15 Annoy recall = 100%
15:37:15 Commencing smooth kNN distance calibration using 1 thread
15:37:16 Initializing from normalized Laplacian + noise
15:37:16 Commencing optimization for 500 epochs, with 105124 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:37:21 Optimization finished
> DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 6)

with-graph-name

Here is the full workflow **without** `graph.name` argument:
> pbmc <- SeuratData::LoadData("pbmc3k")
> # Initialize the Seurat object with the raw (non-normalized data).
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
> pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> all.genes <- rownames(pbmc)
> pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
  |============================================================================================| 100%
> pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
       FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
       PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
       CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
       MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
       HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
       BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
       CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
       TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
PC_ 3 
Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
       HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
       PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
       HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
       NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
PC_ 4 
Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
       SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
       GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
       AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
       LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
PC_ 5 
Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
       GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
       RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
       PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
       FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB 
> #Without graph.name argument
> #pbmc <- FindNeighbors(pbmc, graph.name ="test", dims = 1:10)
> #pbmc <- FindClusters(pbmc, graph.name = "test", algorithm = 4, resolution = 0.5)
> pbmc <- FindNeighbors(pbmc, dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> pbmc <- FindClusters(pbmc, algorithm = 4, resolution = 0.5)
> pbmc <- RunUMAP(pbmc, dims = 1:10)
15:38:52 UMAP embedding parameters a = 0.9922 b = 1.112
15:38:52 Read 2638 rows and found 10 numeric columns
15:38:52 Using Annoy for neighbor search, n_neighbors = 30
15:38:52 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:38:53 Writing NN index file to temp file /tmp/Rtmp66O0pD/filee2d55ca00c5
15:38:53 Searching Annoy index using 1 thread, search_k = 3000
15:38:54 Annoy recall = 100%
15:38:54 Commencing smooth kNN distance calibration using 1 thread
15:38:55 Initializing from normalized Laplacian + noise
15:38:55 Commencing optimization for 500 epochs, with 105124 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:39:00 Optimization finished
> DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 6)

without-graph-name

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pbmc3k.SeuratData_3.1.4 patchwork_1.1.1         SeuratObject_4.0.0      Seurat_4.0.0           
[5] dplyr_1.0.4            

loaded via a namespace (and not attached):
  [1] nlme_3.1-152         matrixStats_0.58.0   RcppAnnoy_0.0.18     RColorBrewer_1.1-2  
  [5] httr_1.4.2           sctransform_0.3.2    tools_4.0.4          utf8_1.1.4          
  [9] R6_2.5.0             irlba_2.3.3          rpart_4.1-15         KernSmooth_2.23-18  
 [13] uwot_0.1.10          mgcv_1.8-33          DBI_1.1.1            lazyeval_0.2.2      
 [17] colorspace_2.0-0     withr_2.4.1          tidyselect_1.1.0     gridExtra_2.3       
 [21] compiler_4.0.4       cli_2.3.1            plotly_4.9.3         labeling_0.4.2      
 [25] scales_1.1.1         lmtest_0.9-38        spatstat.data_2.0-0  ggridges_0.5.3      
 [29] pbapply_1.4-3        rappdirs_0.3.3       spatstat_1.64-1      goftest_1.2-2       
 [33] stringr_1.4.0        digest_0.6.27        spatstat.utils_2.0-0 pkgconfig_2.0.3     
 [37] htmltools_0.5.1.1    parallelly_1.23.0    fastmap_1.1.0        htmlwidgets_1.5.3   
 [41] rlang_0.4.10         rstudioapi_0.13      shiny_1.6.0          farver_2.0.3        
 [45] generics_0.1.0       zoo_1.8-8            jsonlite_1.7.2       ica_1.0-2           
 [49] magrittr_2.0.1       Matrix_1.3-2         Rcpp_1.0.6           munsell_0.5.0       
 [53] fansi_0.4.2          abind_1.4-5          reticulate_1.18      lifecycle_1.0.0     
 [57] stringi_1.5.3        MASS_7.3-53.1        Rtsne_0.15           plyr_1.8.6          
 [61] grid_4.0.4           parallel_4.0.4       listenv_0.8.0        promises_1.2.0.1    
 [65] ggrepel_0.9.1        crayon_1.4.1         miniUI_0.1.1.1       deldir_0.2-10       
 [69] lattice_0.20-41      cowplot_1.1.1        splines_4.0.4        tensor_1.5          
 [73] pillar_1.5.0         igraph_1.2.6         future.apply_1.7.0   reshape2_1.4.4      
 [77] codetools_0.2-18     leiden_0.3.7         glue_1.4.2           data.table_1.14.0   
 [81] BiocManager_1.30.10  vctrs_0.3.6          png_0.1-7            httpuv_1.5.5        
 [85] gtable_0.3.0         RANN_2.6.1           purrr_0.3.4          polyclip_1.10-0     
 [89] tidyr_1.1.2          SeuratData_0.2.1     scattermore_0.7      future_1.21.0       
 [93] assertthat_0.2.1     ggplot2_3.3.3        xfun_0.21            mime_0.10           
 [97] xtable_1.8-4         RSpectra_0.16-0      later_1.1.0.1        survival_3.2-7      
[101] viridisLite_0.3.0    tibble_3.1.0         tinytex_0.29         cluster_2.1.1       
[105] globals_0.14.0       fitdistrplus_1.1-3   ellipsis_0.3.1       ROCR_1.0-11   
timoast commented 3 years ago

Thanks for reporting! This happens because when you provide only one name to the graph.name argument, only the nearest-neighbor graph is stored, and not the shared-nearest-neighbor (SNN) graph. By default (when no graph name is given) the clustering algorithm uses the SNN graph, which is why you're getting different results with/without the parameter here.

The graph.name parameter was not well documented, so I've updated it to more fully explain the behaviour of the parameter and explain that you should provide a vector of two names, one for the NN graph and one for the SNN graph.

If you think the clustering results are better using the NN rather than SNN graph, you can certainly use that. You might also find that increasing the clustering resolution when using the SNN graph might give a similar clustering result (separate that cluster 9 from 6)

Pancreas-Pratik commented 3 years ago

Thank you very much for helping me understand, @timoast! I really appreciate the Satija Lab's effort! It must be challenging to keep up with everything and everyone, but you guys are great at it! Hope you never stop! :1st_place_medal: