Scale or not to scale the genes by their variance in real datasets

spriyansh commented 1 year ago

Hi @kstreet13, I was following the Slingshot vignette. In section 2.3 Dimensionality Reduction, you explain why it's not worth scaling the simulated data before computing PCs. I wonder whether this applies to real datasets. Because in all the single-cell tutorials dealing with real datasets, I see that authors scale their data before computing the PCA; what is your opinion on that?

kstreet13 commented 1 year ago

I think we may be talking about two different things here, but I stand by the opinion expressed in the vignette, that you shouldn't scale genes before PCA.

Many people, myself included, do often "scale their data before computing the PCA" in the sense that they perform normalization. This transformation is often of the form y = log(1e4*x/N + 1) where x is the original count, N is the total count for that cell, and y is the normalized expression value. A scaling factor is certainly part of this transformation, but it's not what I'm referring to in that section.

Rather, some people will additionally scale each row (gene) of their data when performing PCA, so that they all have the same variance. This has the effect of dampening signal from important, highly variable genes, and amplifying the random variation in housekeeping genes and lowly-expressed genes. This is what the scale (or scale.) parameters do in most PCA implementations and it makes sense in some contexts, but in scRNAseq, I don't think it's reasonable to assume that all genes contain the same amount of information.

spriyansh commented 1 year ago

Thanks, @kstreet13, for the quick and detailed response.

kstreet13 / slingshot

Scale or not to scale the genes by their variance in real datasets #206