satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.3k stars 917 forks source link

DESeq2 results not the same when run through FindMarkers #7764

Closed mzz92 closed 1 year ago

mzz92 commented 1 year ago

Discussed in https://github.com/satijalab/seurat/discussions/7763

Originally posted by **mzz92** September 5, 2023 I'm having issues comparing DESeq2 results when run through FindMarkers versus when DESeq2 is run on its own. (I know that Seurat is planning to drop DESeq2 support, but I'd still like to track down this discrepancy.) In the code below, the DESeq2 calls mirrors those in the DESeq2DETest function (https://github.com/satijalab/seurat/blob/master/R/differential_expression.R). The results when DESeq2 is run manually in `res` are different from results when run through `FindMarkers(pbmc_small, ident.1=1, ident.2=2, test.use="DESeq2", slot="counts")`. I'm particularly puzzled by the different log2FoldChange values that are reported. For example: ``` > res_pre["MS4A1",] log2 fold change (MLE): group Group2 vs Group1 Wald test p-value: group Group2 vs Group1 DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj MS4A1 1.86572 1.28329 0.385088 3.33247 0.000860783 0.0039596 > > findmarkers["MS4A1",] p_val avg_log2FC pct.1 pct.2 p_val_adj MS4A1 0.0008607826 -0.7615781 1 1 0.19798 ``` What am I doing wrong here? Thanks! ``` library(Seurat) library(DESeq2) data("pbmc_small") # add 1 so that DESeq2 will run pbmc_small@assays$RNA@counts <- as.matrix(pbmc_small@assays$RNA@counts) + 1 Idents(pbmc_small) <- pbmc_small$RNA_snn_res.1 pbmc_small <- subset(pbmc_small, subset= RNA_snn_res.1 != 0) cells.1 <- WhichCells(pbmc_small, idents = 1) cells.2 <- WhichCells(pbmc_small, idents = 2) group.info <- data.frame(row.names = colnames(pbmc_small)) group.info[cells.1, "group"] <- "Group1" group.info[cells.2, "group"] <- "Group2" group.info[, "group"] <- factor(x = group.info[, "group"]) group.info$wellKey <- rownames(x = group.info) dds1 <- DESeq2::DESeqDataSetFromMatrix( countData = pbmc_small@assays$RNA@counts, colData = group.info, design = ~ group ) dds1 <- DESeq2::estimateSizeFactors(object = dds1) dds1 <- DESeq2::estimateDispersions(object = dds1, fitType = "local") dds1 <- DESeq2::nbinomWaldTest(object = dds1) resultsNames(dds1) res_pre <- results(dds1, name = "group_Group2_vs_Group1", alpha = 0.05) res <- lfcShrink(dds1, coef = "group_Group2_vs_Group1", res=res_pre, type = "apeglm") findmarkers <- FindMarkers(pbmc_small, ident.1=1, ident.2=2, test.use="DESeq2", slot="counts") ``` My sessionInfo(): ``` R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] DESeq2_1.38.3 SummarizedExperiment_1.28.0 [3] Biobase_2.54.0 MatrixGenerics_1.6.0 [5] matrixStats_1.0.0 GenomicRanges_1.46.1 [7] GenomeInfoDb_1.30.1 IRanges_2.28.0 [9] S4Vectors_0.36.1 BiocGenerics_0.44.0 [11] SeuratObject_4.9.9.9091 Seurat_4.3.0.1 loaded via a namespace (and not attached): [1] spam_2.9-1 plyr_1.8.8 igraph_1.5.1 [4] lazyeval_0.2.2 sp_2.0-0 splines_4.1.2 [7] BiocParallel_1.32.5 listenv_0.9.0 scattermore_1.2 [10] ggplot2_3.4.2 digest_0.6.33 htmltools_0.5.6 [13] fansi_1.0.4 magrittr_2.0.3 memoise_2.0.1 [16] tensor_1.5 cluster_2.1.4 ROCR_1.0-11 [19] globals_0.16.2 Biostrings_2.62.0 annotate_1.72.0 [22] bdsmatrix_1.3-6 spatstat.sparse_3.0-2 colorspace_2.1-0 [25] blob_1.2.4 apeglm_1.16.0 ggrepel_0.9.3 [28] dplyr_1.1.2 crayon_1.5.2 RCurl_1.98-1.12 [31] jsonlite_1.8.7 progressr_0.13.0 spatstat.data_3.0-1 [34] survival_3.5-5 zoo_1.8-12 glue_1.6.2 [37] polyclip_1.10-4 gtable_0.3.3 zlibbioc_1.40.0 [40] XVector_0.34.0 leiden_0.4.3 DelayedArray_0.20.0 [43] future.apply_1.11.0 abind_1.4-5 scales_1.2.1 [46] mvtnorm_1.2-2 DBI_1.1.3 spatstat.random_3.1-5 [49] miniUI_0.1.1.1 Rcpp_1.0.11 viridisLite_0.4.2 [52] xtable_1.8-6 emdbook_1.3.13 reticulate_1.30 [55] bit_4.0.5 dotCall64_1.0-2 htmlwidgets_1.6.2 [58] httr_1.4.6 RColorBrewer_1.1-3 ellipsis_0.3.2 [61] ica_1.0-3 pkgconfig_2.0.3 XML_3.99-0.14 [64] uwot_0.1.16 deldir_1.0-9 locfit_1.5-9.8 [67] utf8_1.2.3 tidyselect_1.2.0 rlang_1.1.1 [70] reshape2_1.4.4 later_1.3.1 AnnotationDbi_1.56.2 [73] munsell_0.5.0 tools_4.1.2 cachem_1.0.8 [76] cli_3.6.1 generics_0.1.3 RSQLite_2.3.1 [79] ggridges_0.5.4 stringr_1.5.0 fastmap_1.1.1 [82] goftest_1.2-3 bit64_4.0.5 fitdistrplus_1.1-11 [85] purrr_1.0.1 RANN_2.6.1 KEGGREST_1.34.0 [88] pbapply_1.7-2 future_1.33.0 nlme_3.1-162 [91] mime_0.12 compiler_4.1.2 plotly_4.10.2 [94] png_0.1-8 spatstat.utils_3.0-3 tibble_3.2.1 [97] geneplotter_1.72.0 stringi_1.7.12 lattice_0.21-8 [100] Matrix_1.5-4.1 vctrs_0.6.3 pillar_1.9.0 [103] lifecycle_1.0.3 spatstat.geom_3.2-4 lmtest_0.9-40 [106] RcppAnnoy_0.0.21 data.table_1.14.8 cowplot_1.1.1 [109] bitops_1.0-7 irlba_2.3.5.1 httpuv_1.6.11 [112] patchwork_1.1.2 R6_2.5.1 promises_1.2.0.1 [115] KernSmooth_2.23-22 gridExtra_2.3 parallelly_1.36.0 [118] codetools_0.2-19 MASS_7.3-60 sctransform_0.3.5 [121] GenomeInfoDbData_1.2.7 parallel_4.1.2 grid_4.1.2 [124] tidyr_1.3.0 coda_0.19-4 Rtsne_0.16 [127] bbmle_1.0.25 spatstat.explore_3.2-1 numDeriv_2016.8-1.1 [130] shiny_1.7.4.1 ```
longmanz commented 1 year ago

Hi, Please see my comment at https://github.com/satijalab/seurat/discussions/7763 . You can see that P-values are actually the same from the 2 strategies. If you are worried about the log-fold-change: In DESeq2, they calculate the log-fc after the DESeq2 normalization (which normalizes the counts by using a size-factor estimated from all the genes) and a special shrinking strategy (to make the estimates of log-fc more robust when some genes do not have enough information). In Seurat, we simply calculate the log-fc by directly using data from the slot "data" (which is log-normalized) without other modifications. That is way you see the 2 log-fc are different.