satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.25k stars 904 forks source link

Different Clustering and uMAP results in V3 vs V4 #7767

Closed phorve closed 2 months ago

phorve commented 1 year ago

Hi all,

I recently inherited a project from someone else that includes sequencing data from larval zebrafish in multiple conditions. The original analysis was performed using Seurat V3.2.0 and I reran the analysis using V4.3.0.1 and found that the number of clusters was different and the uMAP projection had changed. Is this expected behavior? I have two main questions:

  1. Can you explain why there are varying results between the two versions?
  2. Should I "trust" one version over another?

Thank you in advance for your help! I've included the code used (below), the two differing uMAP projections (below) and linked the final .rds objects created by each script. Please let me know if there is any other information that would be helpful in answering!

library('Seurat')

# Read in the 10X Datasets
cv <- Read10X(data.dir = "./Talapas/phase3/cv/")
gf<- Read10X(data.dir = "./Talapas/phase3/gf/")
befa <- Read10X(data.dir = "./Talapas/phase3/befa/")

# Creates the seurat object with the raw (non-normalized data).
cv <- CreateSeuratObject(counts = cv, project = "cv", min.cells = 3, min.features = 200)
gf <- CreateSeuratObject(counts = gf, project = "gf", min.cells = 3, min.features = 200)
befa <- CreateSeuratObject(counts = befa, project = "befa", min.cells = 3, min.features = 200)

# Initiate treatments
cv@meta.data$treatment <- "cv"
gf@meta.data$treatment<- "gf"
befa@meta.data$treatment<- "befa"
#Initiates the tissuetype as "whole"
cv@meta.data$tissuetype <- "whole"
gf@meta.data$tissuetype <- "whole"
befa@meta.data$tissuetype <- "whole"
#Initiates the tissuetypeTreatment to be able to group dissected and treatment 
cv@meta.data$tissuetype_x_treatment <- "whole_cv"
gf@meta.data$tissuetype_x_treatment <- "whole_gf"
befa@meta.data$tissuetype_x_treatment <- "whole_befa"

# Reads in mito genes
mito_genes <- read.table("./Talapas/mito_genes.txt", header = TRUE, sep = "\t") 
# Calculates the intersect of subsets of the rownames in data and Gene.stable.ID in mito_genes
mito.genes.cv <- intersect(rownames(cv), mito_genes$Gene.stable.ID)
mito_genes.gf <- intersect(rownames(gf), mito_genes$Gene.stable.ID)
mito_genes.befa <- intersect(rownames(befa), mito_genes$Gene.stable.ID)
# sums the counts mito genes and divides by total gene expression per cell
percent.mito.cv <- Matrix::colSums(cv[mito.genes.cv, ])/Matrix::colSums(cv)
percent.mito.gf <- Matrix::colSums(gf[mito_genes.gf, ])/Matrix::colSums(gf)
percent.mito.befa <- Matrix::colSums(befa[mito_genes.befa, ])/Matrix::colSums(befa)
# stores the percent mito gene expresion in metadata
cv <- AddMetaData(object = cv, metadata = percent.mito.cv, col.name = "percent.mito")
gf <- AddMetaData(object = gf, metadata = percent.mito.gf, col.name = "percent.mito")
befa <- AddMetaData(object = befa, metadata = percent.mito.befa, col.name = "percent.mito")

#filter cells that have unique feature counts, removing unwanted cells from the dataset,
cv <- subset(cv, subset = nFeature_RNA < 3000 & nCount_RNA < 50000 & percent.mito < .20)
gf <- subset(gf, subset = nFeature_RNA < 3000 & nCount_RNA < 50000 & percent.mito < .20)
befa <- subset(befa, subset = nFeature_RNA < 3000 & nCount_RNA < 50000 & percent.mito < .20)

options(future.globals.maxSize= 10891289600)
treatment.list <- c(cv, gf, befa)
#Employs a global-scaling normalization method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, 
for (i in 1:length(treatment.list)) {
  treatment.list[[i]] <- NormalizeData(treatment.list[[i]], verbose = TRUE, normalization.method = "LogNormalize", scale.factor = 10000)
}
reference.list <- treatment.list
# Identifying matching cell pairs across datasets. "Anchors" represent a similar biological state, weighted based on the overlap in their nearest neighbors. This creates a reference to transfer data and metadata from one experiment to another.
# integrate three of the objects into a reference, including the dimensionality of the dataset
treatment.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:150, anchor.features = length(rownames(cv)), reduction = "cca")

# returns a Seurat object, which holds an integrated (batch-corrected) expression matrix for all cells, enabling them to be jointly analyzed.
all.treatments.integrated <- IntegrateData(anchorset = treatment.anchors, dims = 1:150)

# switch to integrated assay. The variable features of this assay are automatically set during IntegrateData
DefaultAssay(all.treatments.integrated) <- "integrated"

all.genes=rownames(all.treatments.integrated)

#plan("multiprocess")

#apply a linear transformation ('scaling') that is a standard pre-processing.
# This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
all.treatments.integrated <- ScaleData(all.treatments.integrated, features = all.genes, vars.to.regress = c("nCount_RNA", "percent.mito"), verbose = FALSE)

# Perform a principal components analysis (PCA) on cells, based on the expression data in a SingleCellExperiment object. 
# PCA is a technique used to emphasize variation and bring out strong patterns in a dataset.
all.treatments.integrated <- RunPCA(all.treatments.integrated, npcs = 150, verbose = FALSE, features = all.genes)

#refines the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity).
all.treatments.integrated <- FindNeighbors(all.treatments.integrated, dims = 1:30, features = all.genes, reduction = "pca")

# Clusters cells
# resolution parameter that sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.
all.treatments.integrated <- FindClusters(all.treatments.integrated, resolution = 3)

# uMAP
all.treatments.integrated<- RunUMAP(all.treatments.integrated, dims = 1:30, reduction = "pca")

# switch to integrated assay. The variable features of this assay are automatically set during IntegrateData
DefaultAssay(all.treatments.integrated) <- "RNA"

# Save the final data file 
saveRDS(all.treatments.integrated, "./August2023_CV_GF_BefA_150CCs_30uMAP_3Resolution_SeuratV3.rds")

# Session Info
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS:   /gpfs/packages/R/3.6.2/lib64/R/lib/libRblas.so
LAPACK: /gpfs/packages/R/3.6.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Seurat_3.2.0

loaded via a namespace (and not attached):
 [1] httr_1.4.2            tidyr_1.1.0           jsonlite_1.7.0       
 [4] viridisLite_0.3.0     splines_3.6.2         leiden_0.3.3         
 [7] shiny_1.5.0           ggrepel_0.8.2         globals_0.12.5       
[10] pillar_1.4.6          lattice_0.20-41       glue_1.4.1           
[13] reticulate_1.16       digest_0.6.25         polyclip_1.10-0      
[16] RColorBrewer_1.1-2    promises_1.1.1        colorspace_1.4-1     
[19] cowplot_1.0.0         htmltools_0.5.0       httpuv_1.5.4         
[22] Matrix_1.2-18         plyr_1.8.6            pkgconfig_2.0.3      
[25] listenv_0.8.0         purrr_0.3.4           xtable_1.8-4         
[28] patchwork_1.0.1       scales_1.1.1          RANN_2.6.1           
[31] tensor_1.5            later_1.1.0.1         Rtsne_0.15           
[34] spatstat.utils_1.17-0 tibble_3.0.3          mgcv_1.8-31          
[37] generics_0.0.2        ggplot2_3.3.2         ellipsis_0.3.1       
[40] ROCR_1.0-11           pbapply_1.4-2         lazyeval_0.2.2       
[43] deldir_0.1-28         survival_3.2-3        magrittr_1.5         
[46] crayon_1.3.4          mime_0.9              future_1.18.0        
[49] nlme_3.1-148          MASS_7.3-51.6         ica_1.0-2            
[52] tools_3.6.2           fitdistrplus_1.1-1    data.table_1.12.8    
[55] lifecycle_0.2.0       stringr_1.4.0         plotly_4.9.2.1       
[58] munsell_0.5.0         cluster_2.1.0         irlba_2.3.3          
[61] compiler_3.6.2        rsvd_1.0.3            rlang_0.4.7          
[64] grid_3.6.2            ggridges_0.5.2        goftest_1.2-2        
[67] RcppAnnoy_0.0.16      rappdirs_0.3.1        htmlwidgets_1.5.1    
[70] igraph_1.2.5          miniUI_0.1.1.1        gtable_0.3.0         
[73] codetools_0.2-16      abind_1.4-5           reshape2_1.4.4       
[76] R6_2.4.1              gridExtra_2.3         zoo_1.8-8            
[79] dplyr_1.0.0           uwot_0.1.8            fastmap_1.0.1        
[82] future.apply_1.6.0    KernSmooth_2.23-17    ape_5.4              
[85] spatstat.data_1.4-3   stringi_1.4.6         spatstat_1.64-1      
[88] parallel_3.6.2        Rcpp_1.0.5            rpart_4.1-15         
[91] vctrs_0.3.2           sctransform_0.2.1     png_0.1-7            
[94] tidyselect_1.1.0      lmtest_0.9-37   
library('Seurat')

# Read in the 10X Datasets
cv <- Read10X(data.dir = "./Talapas/phase3/cv/")
gf<- Read10X(data.dir = "./Talapas/phase3/gf/")
befa <- Read10X(data.dir = "./Talapas/phase3/befa/")

# Creates the seurat object with the raw (non-normalized data).
cv <- CreateSeuratObject(counts = cv, project = "cv", min.cells = 3, min.features = 200)
gf <- CreateSeuratObject(counts = gf, project = "gf", min.cells = 3, min.features = 200)
befa <- CreateSeuratObject(counts = befa, project = "befa", min.cells = 3, min.features = 200)

# Initiate treatments
cv@meta.data$treatment <- "cv"
gf@meta.data$treatment<- "gf"
befa@meta.data$treatment<- "befa"
#Initiates the tissuetype as "whole"
cv@meta.data$tissuetype <- "whole"
gf@meta.data$tissuetype <- "whole"
befa@meta.data$tissuetype <- "whole"
#Initiates the tissuetypeTreatment to be able to group dissected and treatment 
cv@meta.data$tissuetype_x_treatment <- "whole_cv"
gf@meta.data$tissuetype_x_treatment <- "whole_gf"
befa@meta.data$tissuetype_x_treatment <- "whole_befa"

# Reads in mito genes
mito_genes <- read.table("./Talapas/mito_genes.txt", header = TRUE, sep = "\t") 
# Calculates the intersect of subsets of the rownames in data and Gene.stable.ID in mito_genes
mito.genes.cv <- intersect(rownames(cv), mito_genes$Gene.stable.ID)
mito_genes.gf <- intersect(rownames(gf), mito_genes$Gene.stable.ID)
mito_genes.befa <- intersect(rownames(befa), mito_genes$Gene.stable.ID)
# sums the counts mito genes and divides by total gene expression per cell
percent.mito.cv <- Matrix::colSums(cv[mito.genes.cv, ])/Matrix::colSums(cv)
percent.mito.gf <- Matrix::colSums(gf[mito_genes.gf, ])/Matrix::colSums(gf)
percent.mito.befa <- Matrix::colSums(befa[mito_genes.befa, ])/Matrix::colSums(befa)
# stores the percent mito gene expresion in metadata
cv <- AddMetaData(object = cv, metadata = percent.mito.cv, col.name = "percent.mito")
gf <- AddMetaData(object = gf, metadata = percent.mito.gf, col.name = "percent.mito")
befa <- AddMetaData(object = befa, metadata = percent.mito.befa, col.name = "percent.mito")

#filter cells that have unique feature counts, removing unwanted cells from the dataset,
cv <- subset(cv, subset = nFeature_RNA < 3000 & nCount_RNA < 50000 & percent.mito < .20)
gf <- subset(gf, subset = nFeature_RNA < 3000 & nCount_RNA < 50000 & percent.mito < .20)
befa <- subset(befa, subset = nFeature_RNA < 3000 & nCount_RNA < 50000 & percent.mito < .20)

options(future.globals.maxSize= 10891289600)
treatment.list <- c(cv, gf, befa)
#Employs a global-scaling normalization method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, 
for (i in 1:length(treatment.list)) {
  treatment.list[[i]] <- NormalizeData(treatment.list[[i]], verbose = TRUE, normalization.method = "LogNormalize", scale.factor = 10000)
}
reference.list <- treatment.list
# Identifying matching cell pairs across datasets. "Anchors" represent a similar biological state, weighted based on the overlap in their nearest neighbors. This creates a reference to transfer data and metadata from one experiment to another.
# integrate three of the objects into a reference, including the dimensionality of the dataset
treatment.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:150, anchor.features = length(rownames(cv)), reduction = "cca")

# returns a Seurat object, which holds an integrated (batch-corrected) expression matrix for all cells, enabling them to be jointly analyzed.
all.treatments.integrated <- IntegrateData(anchorset = treatment.anchors, dims = 1:150)

# switch to integrated assay. The variable features of this assay are automatically set during IntegrateData
DefaultAssay(all.treatments.integrated) <- "integrated"

all.genes=rownames(all.treatments.integrated)

#plan("multiprocess")

#apply a linear transformation ('scaling') that is a standard pre-processing.
# This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
all.treatments.integrated <- ScaleData(all.treatments.integrated, features = all.genes, vars.to.regress = c("nCount_RNA", "percent.mito"), verbose = FALSE)

# Perform a principal components analysis (PCA) on cells, based on the expression data in a SingleCellExperiment object. 
# PCA is a technique used to emphasize variation and bring out strong patterns in a dataset.
all.treatments.integrated <- RunPCA(all.treatments.integrated, npcs = 150, verbose = FALSE, features = all.genes)

#refines the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity).
all.treatments.integrated <- FindNeighbors(all.treatments.integrated, dims = 1:30, features = all.genes, reduction = "pca")

# Clusters cells
# resolution parameter that sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.
all.treatments.integrated <- FindClusters(all.treatments.integrated, resolution = 3)

# uMAP
all.treatments.integrated<- RunUMAP(all.treatments.integrated, dims = 1:30, reduction = "pca")

# switch to integrated assay. The variable features of this assay are automatically set during IntegrateData
DefaultAssay(all.treatments.integrated) <- "RNA"

# Save the final data file 
saveRDS(all.treatments.integrated, "./August2023_CV_GF_BefA_150CCs_30uMAP_3Resolution_SeuratV4.rds")

# Session Info
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SeuratObject_4.1.3 Seurat_4.3.0.1    

loaded via a namespace (and not attached):
  [1] deldir_1.0-9           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.1            magrittr_2.0.3         RcppAnnoy_0.0.21      
  [7] matrixStats_1.0.0      ggridges_0.5.4         compiler_4.3.1        
 [10] spatstat.geom_3.2-4    png_0.1-8              vctrs_0.6.3           
 [13] reshape2_1.4.4         stringr_1.5.0          pkgconfig_2.0.3       
 [16] fastmap_1.1.1          ellipsis_0.3.2         utf8_1.2.3            
 [19] promises_1.2.1         purrr_1.0.2            jsonlite_1.8.7        
 [22] goftest_1.2-3          later_1.3.1            spatstat.utils_3.0-3  
 [25] irlba_2.3.5.1          parallel_4.3.1         cluster_2.1.4         
 [28] R6_2.5.1               ica_1.0-3              stringi_1.7.12        
 [31] RColorBrewer_1.1-3     spatstat.data_3.0-1    reticulate_1.31       
 [34] parallelly_1.36.0      lmtest_0.9-40          scattermore_1.2       
 [37] Rcpp_1.0.11            tensor_1.5             future.apply_1.11.0   
 [40] zoo_1.8-12             sctransform_0.3.5      httpuv_1.6.11         
 [43] Matrix_1.5-4.1         splines_4.3.1          igraph_1.5.1          
 [46] tidyselect_1.2.0       rstudioapi_0.15.0      abind_1.4-5           
 [49] spatstat.random_3.1-5  codetools_0.2-19       miniUI_0.1.1.1        
 [52] spatstat.explore_3.2-1 listenv_0.9.0          lattice_0.21-8        
 [55] tibble_3.2.1           plyr_1.8.8             shiny_1.7.5           
 [58] ROCR_1.0-11            Rtsne_0.16             future_1.33.0         
 [61] survival_3.5-5         polyclip_1.10-4        fitdistrplus_1.1-11   
 [64] pillar_1.9.0           KernSmooth_2.23-21     plotly_4.10.2         
 [67] generics_0.1.3         sp_2.0-0               ggplot2_3.4.3         
 [70] munsell_0.5.0          scales_1.2.1           globals_0.16.2        
 [73] xtable_1.8-4           glue_1.6.2             lazyeval_0.2.2        
 [76] tools_4.3.1            data.table_1.14.8      RANN_2.6.1            
 [79] leiden_0.4.3           cowplot_1.1.1          grid_4.3.1            
 [82] tidyr_1.3.0            colorspace_2.1-0       nlme_3.1-162          
 [85] patchwork_1.1.3        cli_3.6.1              spatstat.sparse_3.0-2 
 [88] fansi_1.0.4            viridisLite_0.4.2      dplyr_1.1.2           
 [91] uwot_0.1.16            gtable_0.3.3           digest_0.6.33         
 [94] progressr_0.14.0       ggrepel_0.9.3          htmlwidgets_1.6.2     
 [97] htmltools_0.5.6        lifecycle_1.0.3        httr_1.4.7            
[100] mime_0.12              MASS_7.3-60           
Seurat V3.2.0 Seurat V4.3.0.1
uMAP_V3 uMAP_V4
dcollins15 commented 2 months ago

Thanks for using Seurat!

It appears that this issue has gone stale. In an effort to keep our Issues board from getting more unruly than it already is, we’re going to begin closing out issues that haven’t had any activity since the release of v4.4.0.

If this issue is still relevant we strongly encourage you to reopen or repost it, especially if you didn’t initially receive a response from us.