satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
203 stars 33 forks source link

Default clipping value: sqrt(n_cells/30) or sqrt(n_cells)? #127

Closed dkobak closed 2 years ago

dkobak commented 2 years ago

The default clipping of residuals in Seurat::SCTransform appears to be sqrt(n_cells/30) (see https://satijalab.org/seurat/reference/sctransform) and not sqrt(n_cells) as in sctransform::vst and also as described in the Methods section of Hafemeister & Satija 2019 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1).

The consequence of this is that running Seurat::SCTransform on the pbmc33k dataset used in the original paper does not produce the same residuals, and the difference is rather large: a very different set of genes is selected as most variable, compared to what we see in Figure 4C.

What is the preferred default value of clipping (and why)? Was the default at some point changed from sqrt(n_cells) to sqrt(n_cells/30)?

dkobak commented 2 years ago

Okay I see now that this has been answered by Rahul here https://github.com/satijalab/seurat/issues/2414:

How much to clip is an empirical determination. When originally writing vst, we used a simple default of sqrt(N). As we tested more datasets in Seurat, we felt it was helpful to impose a more stringent cutoff.

I am closing this issue.