satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
203 stars 33 forks source link

When Sctransform should not be used? #104

Open ElyasMo opened 3 years ago

ElyasMo commented 3 years ago

Dear ChristophH, Do you have a tip when we should not proceed with SCTransform. I have several spatial transcriptomics dataset and it has been a few weeks that I am trying to normalize my data through SCTransform. I just figured out that this method does not lead to a correct clustering of the spots while simple normalization works much better after regressing out the ratio of mitochondrial genes. So, I was wondering why SCTransform does not work well for my data while it is a powerful and highly recommended method of normalization.

One more thing is that, I have followed exactly the same workflow and codes in windows and mac and, surprisingly, I received slightly different clusters. How is it possible?

I am just adding my codes which is based on the standard tutorials and I have not added any new things.

Thank you very much in advance.


library(Seurat) 
library(sctransform)

#####Reading in the 10xVisium objects

setwd("D:/Poland/PHD/spatial/Second_set/HE_sample/")
Directory="D:/Poland/PHD/spatial/Second_set/HE_sample/"
HE_sample_2<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                             assay = "Spatial",
                             slice = "HEsmpl2",
                             filter.matrix = TRUE,
                             to.upper = FALSE)

setwd("D:/Poland/PHD/spatial/Second_set/HE_control/")
Directory="D:/Poland/PHD/spatial/Second_set/HE_control/"
HE_control_2<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                              assay = "Spatial",
                              slice = "HEctr2",
                              filter.matrix = TRUE,
                              to.upper = FALSE)

setwd("D:/Poland/PHD/spatial/project/HE/HE_sample_2248/")
Directory="D:/Poland/PHD/spatial/project/HE/HE_sample_2248/"
HE_sample_1<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                             assay = "Spatial",
                             slice = "HEsmpl1",
                             filter.matrix = TRUE,
                             to.upper = FALSE)

setwd("D:/Poland/PHD/spatial/project/HE/HE_control_2339/")
Directory="D:/Poland/PHD/spatial/project/HE/HE_control_2339/"
HE_control_1<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                              assay = "Spatial",
                              slice = "HEctr1",
                              filter.matrix = TRUE,
                              to.upper = FALSE)
#####integration
setwd("D:/Poland/PHD/spatial/project/CR/CR_sample_2248/")
Directory="D:/Poland/PHD/spatial/project/CR/CR_sample_2248/"
CR_sample_1<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                             assay = "Spatial",
                             slice = "CRsmpl1",
                             filter.matrix = TRUE,
                             to.upper = FALSE)

setwd("D:/Poland/PHD/spatial/project/CR/CR_control_2339/")
Directory="D:/Poland/PHD/spatial/project/CR/CR_control_2339/"
CR_control_1<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                              assay = "Spatial",
                              slice = "CRctr1",
                              filter.matrix = TRUE,
                              to.upper = FALSE)
setwd("D:/Poland/PHD/spatial/Second_set/CR_sample/")
Directory="D:/Poland/PHD/spatial/Second_set/CR_sample/"
CR_sample_2<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                             assay = "Spatial",
                             slice = "CRsmpl2",
                             filter.matrix = TRUE,
                             to.upper = FALSE)
setwd("D:/Poland/PHD/spatial/Second_set/CR_control/")
Directory="D:/Poland/PHD/spatial/Second_set/CR_control/"
CR_control_2<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5",
                              assay = "Spatial",
                              slice = "CRctr2",
                              filter.matrix = TRUE,
                              to.upper = FALSE)

##adding the percentage of MT genes to metadata
CR_control_1[["percent.mt"]] <- PercentageFeatureSet(CR_control_1, pattern = "^MT-")
CR_sample_1[["percent.mt"]] <- PercentageFeatureSet(CR_sample_1, pattern = "^MT-")
CR_control_2[["percent.mt"]] <- PercentageFeatureSet(CR_control_2, pattern = "^MT-")
CR_sample_2[["percent.mt"]] <- PercentageFeatureSet(CR_sample_2, pattern = "^MT-")
HE_control_1[["percent.mt"]] <- PercentageFeatureSet(HE_control_1, pattern = "^MT-")
HE_sample_1[["percent.mt"]] <- PercentageFeatureSet(HE_sample_1, pattern = "^MT-")
HE_control_2[["percent.mt"]] <- PercentageFeatureSet(HE_control_2, pattern = "^MT-")
HE_sample_2[["percent.mt"]] <- PercentageFeatureSet(HE_sample_2, pattern = "^MT-")

##Filtering the metadata basd on various criteria
CR_control_1 <- subset(CR_control_1, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15)
CR_sample_1 <- subset(CR_sample_1, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15 )
CR_control_2 <- subset(CR_control_2, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15)
CR_sample_2 <- subset(CR_sample_2, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15 )
HE_control_1 <- subset(HE_control_1, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15)
HE_sample_1 <- subset(HE_sample_1, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15)
HE_control_2 <- subset(HE_control_2, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15)
HE_sample_2 <- subset(HE_sample_2, subset = nFeature_Spatial > 200 & nFeature_Spatial < 7000 & percent.mt < 15)
HE_sample_1<-SCTransform(HE_sample_1, assay = "Spatial", verbose = FALSE)
HE_sample_1<- FindVariableFeatures(HE_sample_1, selection.method = "vst", 
                                    nfeatures = 3000, verbose = FALSE)

HE_control_1<-SCTransform(HE_control_1, assay = "Spatial", verbose = FALSE)
HE_control_1<- FindVariableFeatures(HE_control_1, selection.method = "vst", 
                                     nfeatures = 3000, verbose = FALSE)
HE_sample_2<-SCTransform(HE_sample_2, assay = "Spatial", verbose = FALSE)
HE_sample_2<- FindVariableFeatures(HE_sample_2, selection.method = "vst", 
                                    nfeatures = 3000, verbose = FALSE)

HE_control_2<-SCTransform(HE_control_2, assay = "Spatial", verbose = FALSE)
HE_control_2<- FindVariableFeatures(HE_control_2, selection.method = "vst", 
                                     nfeatures = 3000, verbose = FALSE)

CR_sample_1 <- SCTransform(CR_sample_1, assay = "Spatial", verbose = FALSE)
CR_sample_1<- FindVariableFeatures(CR_sample_1, selection.method = "vst", 
                                    nfeatures = 3000, verbose = FALSE)

CR_control_1<-SCTransform(CR_control_1, assay = "Spatial", verbose = FALSE)
CR_control_1<- FindVariableFeatures(CR_control_1, selection.method = "vst", 
                                     nfeatures = 3000, verbose = FALSE)

CR_sample_2 <- SCTransform(CR_sample_2, assay = "Spatial", verbose = FALSE)
CR_sample_2<- FindVariableFeatures(CR_sample_2, selection.method = "vst", 
                                    nfeatures = 3000, verbose = FALSE)

CR_control_2<-SCTransform(CR_control_2, assay = "Spatial", verbose = FALSE)
CR_control_2<- FindVariableFeatures(CR_control_2, selection.method = "vst", 
                                     nfeatures = 3000, verbose = FALSE)

HE_sample_1 <- RunPCA(HE_sample_1, verbose = FALSE)
HE_sample_1 <- RunUMAP(HE_sample_1, dims = 1:30, verbose = FALSE)
HE_sample_1 <- FindNeighbors(HE_sample_1, dims = 1:30, verbose = FALSE)
HE_sample_1 <- FindClusters(HE_sample_1, verbose = FALSE,resolution = 0.3) 

HE_control_1 <- RunPCA(HE_control_1, verbose = FALSE)
HE_control_1 <- RunUMAP(HE_control_1, dims = 1:30, verbose = FALSE)
HE_control_1 <- FindNeighbors(HE_control_1, dims = 1:30, verbose = FALSE)
HE_control_1 <- FindClusters(HE_control_1, verbose = FALSE, resolution = 0.3)

HE_sample_2 <- RunPCA(HE_sample_2, verbose = FALSE)
HE_sample_2 <- RunUMAP(HE_sample_2, dims = 1:30, verbose = FALSE)
HE_sample_2 <- FindNeighbors(HE_sample_2, dims = 1:30, verbose = FALSE)
HE_sample_2 <- FindClusters(HE_sample_2, verbose = FALSE,resolution = 0.3) 

HE_control_2 <- RunPCA(HE_control_2, verbose = FALSE)
HE_control_2 <- RunUMAP(HE_control_2, dims = 1:30, verbose = FALSE)
HE_control_2 <- FindNeighbors(HE_control_2, dims = 1:30, verbose = FALSE)
HE_control_2 <- FindClusters(HE_control_2, verbose = FALSE, resolution = 0.3)

CR_sample_1 <- RunPCA(CR_sample_1, verbose = FALSE)
CR_sample_1 <- RunUMAP(CR_sample_1, dims = 1:30, verbose = FALSE)
CR_sample_1 <- FindNeighbors(CR_sample_1, dims = 1:30, verbose = FALSE)
CR_sample_1 <- FindClusters(CR_sample_1, verbose = FALSE, resolution = 0.3)

CR_control_1 <- RunPCA(CR_control_1, verbose = FALSE)
CR_control_1 <- RunUMAP(CR_control_1, dims = 1:30, verbose = FALSE)
CR_control_1 <- FindNeighbors(CR_control_1, dims = 1:30, verbose = FALSE)
CR_control_1 <- FindClusters(CR_control_1, verbose = FALSE, resolution = 0.3) 

CR_sample_2 <- RunPCA(CR_sample_2, verbose = FALSE)
CR_sample_2 <- RunUMAP(CR_sample_2, dims = 1:30, verbose = FALSE)
CR_sample_2 <- FindNeighbors(CR_sample_2, dims = 1:30, verbose = FALSE)
CR_sample_2 <- FindClusters(CR_sample_2, verbose = FALSE, resolution = 0.3)

CR_control_2 <- RunPCA(CR_control_2, verbose = FALSE)
CR_control_2 <- RunUMAP(CR_control_2, dims = 1:30, verbose = FALSE)
CR_control_2 <- FindNeighbors(CR_control_2, dims = 1:30, verbose = FALSE)
CR_control_2 <- FindClusters(CR_control_2, verbose = FALSE, resolution = 0.3) 
DimPlot(HE_sample_1, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()
DimPlot(HE_control_1, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()
DimPlot(HE_sample_2, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()
DimPlot(HE_control_2, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()
DimPlot(CR_sample_1, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()
DimPlot(CR_control_1, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()
DimPlot(CR_sample_2, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()
DimPlot(CR_control_2, reduction = "umap", label = TRUE, pt.size = 1.3) + NoLegend()

SpatialDimPlot(HE_sample_1, label = FALSE, label.size =2)
SpatialDimPlot(HE_control_1, label = FALSE, label.size =2)
SpatialDimPlot(HE_sample_2, label = FALSE, label.size =2)
SpatialDimPlot(HE_control_2, label = FALSE, label.size =2)
SpatialDimPlot(CR_sample_1, label = FALSE, label.size =2)
SpatialDimPlot(CR_control_1, label = FALSE, label.size =2)
SpatialDimPlot(CR_sample_2, label = FALSE, label.size =2)
SpatialDimPlot(CR_control_2, label = FALSE, label.size =2)

Here is the information from my datasets after pre-processing/filtration:

An object of class Seurat 
36601 features across 3091 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

An object of class Seurat 
36601 features across 2470 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

An object of class Seurat 
36601 features across 2328 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

An object of class Seurat 
36601 features across 2707 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

An object of class Seurat 
36601 features across 3328 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

An object of class Seurat 
36601 features across 2439 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

An object of class Seurat 
36601 features across 2266 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

An object of class Seurat 
36601 features across 2754 samples within 1 assay 
Active assay: Spatial (36601 features, 0 variable features)

And here is my sessionInfo:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.1.1      RCurl_1.98-1.2     forcats_0.5.1      stringr_1.4.0      dplyr_1.0.5        purrr_0.3.4       
 [7] readr_1.4.0        tidyr_1.1.3        tibble_3.1.0       ggplot2_3.3.3      tidyverse_1.3.0    sctransform_0.3.2 
[13] SeuratObject_4.0.0 Seurat_4.0.0      

loaded via a namespace (and not attached):
  [1] Rtsne_0.15           colorspace_2.0-0     deldir_0.2-10        ellipsis_0.3.1       ggridges_0.5.3       fs_1.5.0            
  [7] rstudioapi_0.13      spatstat.data_2.0-0  leiden_0.3.7         listenv_0.8.0        farver_2.1.0         ggrepel_0.9.1       
 [13] bit64_4.0.5          lubridate_1.7.10     RSpectra_0.16-0      fansi_0.4.2          xml2_1.3.2           codetools_0.2-16    
 [19] splines_4.0.2        knitr_1.31           polyclip_1.10-0      jsonlite_1.7.2       broom_0.7.5          ica_1.0-2           
 [25] dbplyr_2.1.0         cluster_2.1.1        png_0.1-7            uwot_0.1.10          shiny_1.6.0          compiler_4.0.2      
 [31] httr_1.4.2           backports_1.2.1      assertthat_0.2.1     Matrix_1.2-18        fastmap_1.1.0        lazyeval_0.2.2      
 [37] cli_2.3.1            limma_3.44.3         later_1.1.0.1        htmltools_0.5.1.1    tools_4.0.2          igraph_1.2.6        
 [43] gtable_0.3.0         glue_1.4.2           RANN_2.6.1           reshape2_1.4.4       Rcpp_1.0.6           spatstat_1.64-1     
 [49] scattermore_0.7      cellranger_1.1.0     vctrs_0.3.6          nlme_3.1-148         lmtest_0.9-38        xfun_0.21           
 [55] globals_0.14.0       rvest_1.0.0          mime_0.10            miniUI_0.1.1.1       lifecycle_1.0.0      irlba_2.3.3         
 [61] goftest_1.2-2        future_1.21.0        MASS_7.3-51.6        zoo_1.8-8            scales_1.1.1         hms_1.0.0           
 [67] promises_1.1.1       spatstat.utils_2.0-0 parallel_4.0.2       RColorBrewer_1.1-2   yaml_2.2.1           reticulate_1.18     
 [73] pbapply_1.4-3        gridExtra_2.3        rpart_4.1-15         stringi_1.5.3        bitops_1.0-6         rlang_0.4.10        
 [79] pkgconfig_2.0.3      matrixStats_0.58.0   evaluate_0.14        lattice_0.20-41      ROCR_1.0-11          tensor_1.5          
 [85] patchwork_1.1.1      htmlwidgets_1.5.3    labeling_0.4.2       bit_4.0.4            tidyselect_1.1.0     parallelly_1.23.0   
 [91] RcppAnnoy_0.0.18     plyr_1.8.6           magrittr_2.0.1       R6_2.5.0             generics_0.1.0       DBI_1.1.1           
 [97] withr_2.4.1          haven_2.3.1          pillar_1.5.1         mgcv_1.8-31          fitdistrplus_1.1-3   survival_3.1-12     
[103] abind_1.4-5          future.apply_1.7.0   modelr_0.1.8         crayon_1.4.1         hdf5r_1.3.3          KernSmooth_2.23-17  
[109] utf8_1.2.1           plotly_4.9.3         rmarkdown_2.7        readxl_1.3.1         grid_4.0.2           data.table_1.14.0   
[115] reprex_1.0.0         digest_0.6.27        xtable_1.8-4         httpuv_1.5.4         munsell_0.5.0        viridisLite_0.3.0
ChristophH commented 3 years ago

Hi,

I cannot comment on your specific results without actually seeing them and understanding why you think sctransform normalization does not lead "to a correct clustering of the spots".

But I notice that you run FindVariableFeatures after SCTransform. That step is not necessary (and might be a disadvantage) since SCTransform already finds and sets the variable features.

jiangpuxuan commented 2 years ago

I met the very simmilar problem with you. My dataset have many doublets, and the heterotype doublets formed a long 'bridge' between clusters. The sctransform pipeline cannot tell 'bridge' from normal clusters, but the seurat intrinstic 'Normalize' worked well.

oghzzang commented 11 months ago

Dear @ElyasMo

Why didn't you use SCT integration?