Clustering hugely influenced by TCR[A/B/G/D] genes in SCTransform

raunakkar commented 11 months ago

Hi, I am using Seurat v4.1.3 currently. I had been using log normalization wherein I used to NormailzeData > FindVariableFeatures > ScaleData(vars.to.regress = c("percent.mt, percent.rb)) > RunPCA > FindNeighbors > FindClusters. With the advent of SCTransform with v2 regularisation, I have shifted to using that for the last year or so, with the pipeline SCTransorfm(vst.flavor = "v2", vars.to.regress = c("percent.mt, percent.rb, percent.trgenes")). I largely work with datasets coming from T cells only. What I am observing is, using the same dataset varied results using log transform and sctransform. Where in log normalization, I get PC genes not influenced by TR^ genes, in SCTransform, I get mainly TR^ genes in the top PCs. Hence, while in log transform my clustering is not influenced by TR^ genes, in SCTransform it is. The only solution I have found thus far is to remove said TR^ genes from dataset before calculating PCs and adding them back later.

I wanted to know if I am doing something wrong while performing SCTransform or there is any other work around for the same? I am very naive in this field so any kind of help would be hugely appreciated.

Thank you.

saketkc commented 11 months ago

Hi @raunakkar,

While this is expected (since the inherent model of variable selection is different). Once you run SCTransform, you can look at what is the residual variance for each of the genes (and why are they making it to the variable feature list):

> pbmc3k.final <- SCTransform(pbmc3k.final, vst.flavor="v2")
> fa <- pbmc3k.final@assays$SCT@SCTModel.list[[1]]@feature.attributes
> fa <- fa %>% dplyr::arrange(desc(residual_variance))
> head(fa)

It is not incorrect to exclude a set of genes that you know are not biologically interesting from the variable selected features (after running SCTransform and before running PCA). But I would be curious to learn what is causing certain genes to be variable and are they also the top ranked genes (or rank lower in the list and have residual variance close to 1).

Hope this helps!

raunakkar commented 11 months ago

Hi @saketkc, Thank you for the prompt response. It actually gives me a lot more confidence going ahead. Further more, I checked the residual variance list and I have 19 TRBV genes with residual variance ranging from 10-20 (out of 24 total genes in this range). The top residual variance however is 118. And there are 10 genes above the TR^ genes with residual variance ranging from 20-60. Any thoughts regarding this?

saketkc commented 11 months ago

It actually means those genes do have high residual variance (>1) and hence make it to the top variable features list.

satijalab / sctransform

Clustering hugely influenced by TCR[A/B/G/D] genes in SCTransform #175