satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.26k stars 908 forks source link

Why is percent.mt greater than 100? #8786

Closed zhang-HZAU closed 5 months ago

zhang-HZAU commented 5 months ago

I use three seurat objects to merge and then execute SCTransform, and use "SCT" assay to execute PercentageFeatureSet(s_merged, pattern = "^mt-").Then it was discovered that pencent.mt was greater than 100.

fig1

The code that generates the exception is as follows:

merge_list <- list(con, case2wk, case4wk)
s_merged <- merge(x = merge_list[[1]], y = merge_list[2:3]) %>%
  SCTransform(assay = "RNA", variable.features.n = 3000)
s_merged[["percent.mt"]] <- PercentageFeatureSet(s_merged, pattern = "^mt-")

merge_list as follows:

fig2

Debug:

s_merged <- readRDS(glue("{output_dir}/s_merged.rds"))

s_merged[["percent.mt"]] <- PercentageFeatureSet(s_merged, pattern = "^mt-")
debug_meta <- s_merged[[]][s_merged[[]][, "percent.mt"] > 100,]

test_sct_count <- LayerData(object = s_merged, assay = "SCT", 
                            layer = "counts")

test_count_colsum <- colSums(test_sct_count)
test_count_colsum[c("GGTGTCGAGCTCTATG-5_1", "GTCTACCAGTGCTCAT-5_1", "TTCAATCAGTCACGAG-5_1")]

features.layer <- grep(pattern = "^mt-", 
                       x = rownames(x = s_merged[["SCT"]]["counts"]), value = TRUE)

layer.sums <- colSums(x = s_merged[features.layer, 
                                   , drop = FALSE])
layer.sums[c("GGTGTCGAGCTCTATG-5_1", "GTCTACCAGTGCTCAT-5_1", "TTCAATCAGTCACGAG-5_1")]

layer.perc <- layer.sums/s_merged[[]][colnames(test_sct_count), 
                                    paste0("nCount_", "SCT")] * 100
layer.perc[c("GGTGTCGAGCTCTATG-5_1", "GTCTACCAGTGCTCAT-5_1", "TTCAATCAGTCACGAG-5_1")]

feature_count <- layer.sums[c("GGTGTCGAGCTCTATG-5_1", "GTCTACCAGTGCTCAT-5_1", "TTCAATCAGTCACGAG-5_1")]
all_count <- test_count_colsum[c("GGTGTCGAGCTCTATG-5_1", "GTCTACCAGTGCTCAT-5_1", "TTCAATCAGTCACGAG-5_1")]
feature_count/all_count * 100

During the debugging process, it was found that nCount_SCT obtained through colSums was inconsistent with the meta.data in Fig 1.Through the debug code I got the same exception percentage.mt as in meta.data in fig1.My understanding of the calculation process of pencent.mt should be feature_count/all_count * 100, and the result is also in line with expectations. The problem lies in the inconsistency of nCount_SCT.

fig3

fig4

fig5

My question is:

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 8

Matrix products: default
BLAS/LAPACK: /cluster/apps/anaconda3/2020.02/envs/R-4.3.0/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] clusterProfiler_4.10.0      jhuanglabGO_1.0.0          
 [3] ggsignif_0.6.4              pdftools_3.4.0             
 [5] scRNAtoolVis_0.0.7          future_1.33.1              
 [7] ggpubr_0.6.0                BiocParallel_1.36.0        
 [9] scDblFinder_1.16.0          SingleCellExperiment_1.24.0
[11] SingleR_2.4.1               SummarizedExperiment_1.32.0
[13] Biobase_2.62.0              GenomicRanges_1.54.1       
[15] GenomeInfoDb_1.38.6         IRanges_2.36.0             
[17] S4Vectors_0.40.2            BiocGenerics_0.48.1        
[19] MatrixGenerics_1.14.0       matrixStats_1.2.0          
[21] ggthemes_5.1.0              harmony_1.2.0              
[23] Rcpp_1.0.12                 jhuanglabscell_1.0.0       
[25] Seurat_5.0.1                SeuratObject_5.0.1         
[27] sp_2.1-3                    readxl_1.4.3               
[29] patchwork_1.2.0.9000        openxlsx_4.2.5.2           
[31] glue_1.7.0                  jhtools_1.0.0              
[33] pak_0.7.1                   devtools_2.4.5             
[35] usethis_2.2.3               rvcheck_0.2.1              
[37] lubridate_1.9.3             forcats_1.0.0              
[39] stringr_1.5.1               dplyr_1.1.4                
[41] purrr_1.0.2                 readr_2.1.5                
[43] tidyr_1.3.1                 tibble_3.2.1               
[45] ggplot2_3.5.0               tidyverse_2.0.0            
[47] fs_1.6.3                    wget_0.0.1                 

loaded via a namespace (and not attached):
  [1] igraph_2.0.2                  ica_1.0-3                    
  [3] plotly_4.10.4                 scater_1.30.1                
  [5] zlibbioc_1.48.0               tidyselect_1.2.0             
  [7] bit_4.0.5                     doParallel_1.0.17            
  [9] lattice_0.22-5                rjson_0.2.21                 
 [11] blob_1.2.4                    urlchecker_1.0.1             
 [13] S4Arrays_1.2.0                parallel_4.3.0               
 [15] png_0.1-8                     cli_3.6.2                    
 [17] ggplotify_0.1.2               askpass_1.2.0                
 [19] goftest_1.2-3                 BiocIO_1.12.0                
 [21] bluster_1.12.0                BiocNeighbors_1.20.2         
 [23] shadowtext_0.1.3              tximport_1.30.0              
 [25] uwot_0.1.16                   curl_5.2.0                   
 [27] tidytree_0.4.6                mime_0.12                    
 [29] leiden_0.4.3.1                coin_1.4-3                   
 [31] stringi_1.8.3                 backports_1.4.1              
 [33] rjags_4-15                    parallelDist_0.2.6           
 [35] XML_3.99-0.16.1               httpuv_1.6.14                
 [37] AnnotationDbi_1.64.1          magrittr_2.0.3               
 [39] rappdirs_0.3.3                splines_4.3.0                
 [41] ggraph_2.1.0                  org.Hs.eg.db_3.18.0          
 [43] corrplot_0.92                 sctransform_0.4.1            
 [45] ggbeeswarm_0.7.2              sessioninfo_1.2.2            
 [47] DBI_1.2.2                     genefilter_1.84.0            
 [49] withr_3.0.0                   enrichplot_1.22.0            
 [51] xgboost_1.7.7.1               lmtest_0.9-40                
 [53] infercnv_1.18.1               tidygraph_1.3.1              
 [55] sva_3.50.0                    formatR_1.14                 
 [57] rtracklayer_1.62.0            BiocManager_1.30.22          
 [59] htmlwidgets_1.6.4             biomaRt_2.58.2               
 [61] ggrepel_0.9.5                 SparseArray_1.2.4            
 [63] cellranger_1.1.0              annotate_1.80.0              
 [65] reticulate_1.35.0             zoo_1.8-12                   
 [67] XVector_0.42.0                timechange_0.3.0             
 [69] foreach_1.5.2                 fansi_1.0.6                  
 [71] caTools_1.18.2                grid_4.3.0                   
 [73] ggtree_3.10.0                 data.table_1.15.0            
 [75] R.oo_1.26.0                   RSpectra_0.16-1              
 [77] irlba_2.3.5.1                 gridGraphics_0.5-1           
 [79] fastDummies_1.7.3             ellipsis_0.3.2               
 [81] lazyeval_0.2.2                yaml_2.3.8                   
 [83] phyclust_0.1-34               survival_3.5-8               
 [85] scattermore_1.2               BiocVersion_3.18.1           
 [87] crayon_1.5.2                  RcppAnnoy_0.0.22             
 [89] RColorBrewer_1.1-3            progressr_0.14.0             
 [91] tweenr_2.0.2                  later_1.3.2                  
 [93] ggridges_0.5.6                codetools_0.2-19             
 [95] profvis_0.3.8                 KEGGREST_1.42.0              
 [97] Rtsne_0.17                    limma_3.58.1                 
 [99] Rsamtools_2.18.0              filelock_1.0.3               
[101] pkgconfig_2.0.3               xml2_1.3.6                   
[103] GenomicAlignments_1.38.2      aplot_0.2.2                  
[105] spatstat.sparse_3.0-3         ape_5.7-1                    
[107] viridisLite_0.4.2             xtable_1.8-4                 
[109] car_3.1-2                     fastcluster_1.2.6            
[111] plyr_1.8.9                    httr_1.4.7                   
[113] tools_4.3.0                   globals_0.16.2               
[115] pkgbuild_1.4.3                beeswarm_0.4.0               
[117] broom_1.0.5                   nlme_3.1-164                 
[119] hdf5r_1.3.9                   futile.logger_1.4.3          
[121] lambda.r_1.2.4                HDO.db_0.99.1                
[123] dbplyr_2.4.0                  digest_0.6.34                
[125] qpdf_1.3.2                    Matrix_1.6-5                 
[127] farver_2.1.1                  tzdb_0.4.0                   
[129] reshape2_1.4.4                yulab.utils_0.1.4            
[131] viridis_0.6.5                 cachem_1.0.8                 
[133] BiocFileCache_2.10.1          polyclip_1.10-6              
[135] generics_0.1.3                Biostrings_2.70.2            
[137] mvtnorm_1.2-4                 ggfortify_0.4.16             
[139] parallelly_1.37.0             pkgload_1.3.4                
[141] statmod_1.5.0                 RcppTOML_0.2.2               
[143] RcppHNSW_0.6.0                ScaledMatrix_1.10.0          
[145] carData_3.0-5                 pbapply_1.7-2                
[147] spam_2.10-0                   gson_0.1.0                   
[149] dqrng_0.3.2                   utf8_1.2.4                   
[151] graphlayouts_1.1.0            SeuratWrappers_0.3.0         
[153] gtools_3.9.5                  gridExtra_2.3                
[155] shiny_1.8.0                   ini_0.3.1                    
[157] GenomeInfoDbData_1.2.11       R.utils_2.12.3               
[159] RCurl_1.98-1.14               memoise_2.0.1                
[161] pheatmap_1.0.12               scales_1.3.0                 
[163] R.methodsS3_1.8.2             RANN_2.6.1                   
[165] spatstat.data_3.0-4           cluster_2.1.6                
[167] configr_0.3.5                 spatstat.utils_3.0-4         
[169] hms_1.1.3                     fitdistrplus_1.1-11          
[171] munsell_0.5.0                 cowplot_1.1.3                
[173] colorspace_2.1-0              rlang_1.1.3                  
[175] ggdendro_0.1.23               DelayedMatrixStats_1.24.0    
[177] sparseMatrixStats_1.14.0      dotCall64_1.1-1              
[179] ggforce_0.4.2                 scuttle_1.12.0               
[181] mgcv_1.9-1                    coda_0.19-4.1                
[183] TH.data_1.1-2                 remotes_2.4.2.1              
[185] iterators_1.0.14              modeltools_0.2-23            
[187] abind_1.4-5                   interactiveDisplayBase_1.40.0
[189] GOSemSim_2.28.1               libcoin_1.0-10               
[191] treeio_1.26.0                 ggsci_3.0.0                  
[193] futile.options_1.0.1          bitops_1.0-7                 
[195] promises_1.2.1                scatterpie_0.2.1             
[197] RSQLite_2.3.5                 qvalue_2.34.0                
[199] sandwich_3.1-0                fgsea_1.28.0                 
[201] DelayedArray_0.28.0           GO.db_3.18.0                 
[203] compiler_4.3.0                prettyunits_1.2.0            
[205] beachmat_2.18.1               listenv_0.9.1                
[207] AnnotationHub_3.10.0          edgeR_4.0.16                 
[209] BiocSingular_1.18.0           tensor_1.5                   
[211] MASS_7.3-60.0.1               progress_1.2.3               
[213] spatstat.random_3.2-2         R6_2.5.1                     
[215] fastmap_1.1.1                 multcomp_1.4-25              
[217] fastmatch_1.1-4               rstatix_0.7.2                
[219] vipor_0.4.7                   ROCR_1.0-11                  
[221] rsvd_1.0.5                    gtable_0.3.4                 
[223] KernSmooth_2.23-22            miniUI_0.1.1.1               
[225] deldir_2.0-2                  htmltools_0.5.7              
[227] RcppParallel_5.1.7            bit64_4.0.5                  
[229] spatstat.explore_3.2-6        lifecycle_1.0.4              
[231] zip_2.3.1                     restfulr_0.0.15              
[233] vctrs_0.6.5                   spatstat.geom_3.2-8          
[235] DOSE_3.28.2                   scran_1.30.2                 
[237] ggfun_0.1.4                   future.apply_1.11.1          
[239] pillar_1.9.0                  gplots_3.1.3.1               
[241] metapod_1.10.1                locfit_1.5-9.8               
[243] jsonlite_1.8.8                argparse_2.2.2   
mhkowalski commented 5 months ago

Hello,

You should likely use the RNA assay rather than the SCT assay to calculate percent.mt.

Regarding the inconsistency, could you please check if you get the same problem (percent.mt>100) if you run the same code on one of your objects separately? My hunch is that this has something due to running SCT on multi-layered objects.

mhkowalski commented 5 months ago

Also, it would be great if you could provide a reproducible example so we could look into this further. I wasn't able to reproduce this using SeuratData objects.

library(Seurat)
library(SeuratData)
data("pbmcsca")
pbmcsca <- UpdateSeuratObject(pbmcsca)
data("pbmc3k")
pbmc3k <- UpdateSeuratObject(pbmc3k)
pbmcsca[["RNA"]] <- as(object = pbmcsca[["RNA"]], Class = "Assay5")
pbmc3k[["RNA"]] <- as(object = pbmc3k[["RNA"]], Class = "Assay5")
obj <- merge(pbmc3k, pbmcsca)
obj <- SCTransform(obj, variable.features.n = 3000)
sum(obj$nCount_SCT != colSums(obj[['SCT']]$counts))
[1] 0
obj[["percent.mt"]] <- PercentageFeatureSet(obj, pattern = "^MT-")
sum(obj[["percent.mt"]]>100 )
> sum(obj[["percent.mt"]]>100 )
[1] 0
zhang-HZAU commented 5 months ago

Hello,

You should likely use the RNA assay rather than the SCT assay to calculate percent.mt.

Regarding the inconsistency, could you please check if you get the same problem (percent.mt>100) if you run the same code on one of your objects separately? My hunch is that this has something due to running SCT on multi-layered objects.

Thank you very much for your reply.

I'm so confused now, feel like I've encountered the proton in the Three-Body novel, hhh.

Because I'm re-executing with the same input data, code and environment. It is found that the percent.mt calculated using the "sct" assay this time is normal and is less than 100.

The data used this time are: GSE140812, GSE193265. At present, it is normal for percent.mt to use "RNA" assay calculation.

zhang-HZAU commented 5 months ago

I reviewed the entire operation and now I know what caused percent.mt to be greater than 100.

After executing SCTransform, in order to successfully execute FindAllMarkers, there will be an intermediate step s_merged <- PrepSCTFindMarkers(object = s_merged). PrepSCTFindMarkers will update the count value in the expression, but will not update nCount_SCT in meta.data, which leads to the inconsistency in nCount_SCT and percentage.mt is greater than 100.

My wrong execution order:

s_merged <- merge(x = merge_list[[1]], y = merge_list[2:3]) %>%
  SCTransform(assay = "RNA", variable.features.n = 3000)

s_merged <- PrepSCTFindMarkers(object = s_merged)

s_merged[["percent.mt"]] <- PercentageFeatureSet(s_merged, pattern = "^mt-", assay = "SCT")

correct order:

s_merged <- merge(x = merge_list[[1]], y = merge_list[2:3]) %>%
  SCTransform(assay = "RNA", variable.features.n = 3000)

s_merged[["percent.mt"]] <- PercentageFeatureSet(s_merged, pattern = "^mt-", assay = "SCT")

s_merged <- PrepSCTFindMarkers(object = s_merged)

I'm very sorry, personal reasons led to this error. thank you for your reply.