satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
203 stars 33 forks source link

Difference between sctransform::vst() and SCTransform()? #137

Closed micw42 closed 2 years ago

micw42 commented 2 years ago

Hello, I'm new to using SCTransform. I'm confused about the difference between sctransform::vst() and Seurat::SCTransform(). The documentation says that Seurat::SCTransform() calls sctransform::vst(), but they produce very different results.

df is a data frame containing raw counts with genes as rows and samples as columns. I used this code to normalize the data with sctransform::vst(): df1 = sctransform::vst(as.matrix(df))$y %>% as.data.frame() This is what df1 looks like:

image

I used this code to normalize the data with Seurat::SCTransform(): df2 = df %>% as.matrix() %>% CreateSeuratObject() %>% SCTransform() %>% GetAssayData() %>% as.data.frame() This is what df2 looks like:

image

Can you please explain why sctransform::vst() and Seurat::SCTransform() results are different? Also, which one do you recommend for very sparse scRNA-seq data (~90% of all raw counts are 0)? Thank you so much for your help!

saketkc commented 2 years ago

The major difference between SCTransform() and sctransform::vst() are in how residuals are clipped. vst() uses a default of (-sqrt(n), sqrt(n)) while SCTransform() uses a default of (-sqrt(n/30), sqrt(n/30)). Hope that explains the differences. SCtransform() calls vst() internally.

micw42 commented 2 years ago

Thank you for the explanation! I realized that I was using the log1p(counts) slot in the SCTransform() output, instead of the pearson residuals. The actual pearson residuals are very similar to the ones from sctransform::vst(). Do you recommend using log1p(counts) or the pearson residuals for building a regression model?

saketkc commented 2 years ago

Depends on the use case. For integration, we make use of pearson residuals. But for DE, I would recommend using the corrected counts (log1p(corrected counts) as in the data) slot.

micw42 commented 2 years ago

I see, thanks again. What are the benefits of using log1p(corrected counts) instead of the pearson residuals?

saketkc commented 2 years ago

For DE it results in reducing the false positives as we show in the benchmarks here. See this vignette for an application - corrected counts are comparable across experiments (after approrpriate median sequencing depth correction) while pearson residuals are not.