satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
208 stars 33 forks source link

prepSCTFindMarkers data correction increased the fraction of zero values. #129

Closed yamajackr closed 2 years ago

yamajackr commented 2 years ago

Hi Christoph, I appreciate all your work. I integrated two datasets using RPCA-based method after scTransform normalization. One dataset is UMI-based data, and the other is data obtained by Smart-seq2. For Smart-seq2 data, I changed the data distribution using a quantile normalization method (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02078-0). I read the paper for scTransform v2, and tried DE analysis using prepSCTFindMarkers function. I noticed that the fraction of zero values had increased a lot for some genes after prepSCTFindMarkers(). Are these expected things? Shouldn't I care about it? Many thanks in advance.

p1 = ggplot(enframe(tmp@assays$RNA@counts[Gene,]), aes(value))+geom_histogram()+ggtitle('RNA_counts')
p2 = ggplot(enframe(tmp@assays$SCT@counts[Gene,]), aes(value))+geom_histogram()+ggtitle('SCT_counts')
p3 = ggplot(enframe(tmp@assays$SCT@data[Gene,]), aes(value))+geom_histogram()+ggtitle('SCT_data')
tmp_prep = PrepSCTFindMarkers(tmp, assay = 'SCT')
p4 = ggplot(enframe(tmp_prep@assays$SCT@counts[Gene,]), aes(value))+geom_histogram()+ggtitle('post-correction_SCT_count')
p5 = ggplot(enframe(tmp_prep@assays$SCT@data[Gene,]), aes(value))+geom_histogram()+ggtitle('post-correction_SCT_data')
p1+p2+p3+p4+p5

question_plot question_plot2

saketkc commented 2 years ago

Hi @yamamotoryo, this is expected since the correction step uses median sequencing depth for generating counts (effectively downdepths, genes that are lowly expressed overall will tend to have more zeroes (this is independent of PrepSCTFindMarkers which will minimum of the median sequencing depth across models to generate corrected counts). But as we show in the paper, using corrected counts results in overall higher sensitivity & specificity.

library(Seurat)
library(SeuratData)
data("pbmc3k")
pbmc3k <- SCTransform(pbmc3k)
> sum(GetAssayData(pbmc3k, assay = "RNA", slot="counts")["CD4",]==0)
[1] 2386
> sum(GetAssayData(pbmc3k, assay = "SCT", slot="counts")["CD4",]==0)
[1] 2390
yamajackr commented 2 years ago

Thank you so much @saketkc !