do.scale = FALSE leads to many subpopulations in UMAP

george-hall-ucl commented 2 years ago

Thanks for the great tool!

We are analysing a dataset generated using 10X Chromium gene expression + VDJ sequencing. We have noticed that if we create a UMAP using SCTransform with its default parameters then we end up with many subpopulations (i.e. a main cluster for each of the three largest samples in our data with distinct subpopulations surrounding it). However, these sub-populations do not appear if we use the NormalizeData -> FindVariableFeatures -> ScaleData workflow. Similarly, they do not appear if we use SCTransform with do_scale = TRUE. Therefore it appears that it is scaling the data that causes this issue. I have uploaded an Rmarkdown notebook to show what I am talking about: https://github.com/george-hall-ucl/different_outcome_from_scaling/blob/main/different_outcome_from_scaling.md.

We noticed that each of the subpopulations appears to be caused by one of seven genes (TRBV genes, related the to the T-cell receptor). If we try to regress out the effect of these genes by assigning a score for all seven genes using AddModuleScore() and then SCTransform with vars.to.regress = "trbv.score1" then the subpopulations remain. We see a similar output if we add a score for each of these seven genes individually and then attempt to regress out each in turn.

We don't understand why this is the case, and we were wondering if you could please help us to understand what is going on here? Does this small set of genes that are highly expressed in single subpopulations cause issues with SCTransform's modelling, or do you believe that it is giving the intended output? We have also observed that cell type classifications carried out using the ScaleData'd data appear to be incorrect, whereas those carried out using the SCTransformed data (with subpopulations) appear much more feasible, perhaps implying that the unscaled output from SCTransform is in some sense "more correct".

Any insights would be greatly appreciated!

Many thanks, George

saketkc commented 2 years ago

Thanks for detailing your observations in a report @george-hall-ucl. This generally indicates there are some "batch" effects in the data - is each "sample" in your data a single "donor" as well?

The reason some clusters are defining a separate cluster is because these genes have super high pearson residuals. We try to address this problem with th ( vst.flavor="v2") v2 model. I would suggest you try it, and if that does not help please feel free to email me the objects at schoudhary@nygenome.org. do.scale=TRUE brings the pearson residuals to same scale - which might be helping here (given some genes had super high pearson residuals to start off with), but in most cases can dampen the signal (depending on the signal to noise ratio).

george-hall-ucl commented 2 years ago

Hi Saket,

Thank you for the fast response! These samples are all from the same donor, so there is no batch effect (at least, no batch effect caused by donor - of course there could be one caused by something else in the sampling/sequencing process).

I've checked the Pearson residuals of the DE genes that are driving the subpopulations and, as you said, they are very close to the upper cutoff (which in our case is around 32). They have pretty much the largest residuals among all genes. We have now run SCTransform with vst.flavor="v2", but unfortunately this didn't resolve the issue (i.e. the subpopulations still appeared).

I just ran the NormalizeData->FindVariableFeatures->ScaleData workflow with do.scale=F in ScaleData and the subpopulations did not appear, so it really seems like this is specific to SCTransform.

Thank you for the offer of looking at the objects - I guess we'll play with this a bit more on our end and maybe send them over if we still don't fully work out what is going on.

Thanks, George

saketkc commented 2 years ago

You can also reduce the clip range: SCTransform(clip.range = c(-10,10)).

saketkc commented 2 years ago

Closing as this seems to have been solved. Please feel to reopen with any follow up questions.

satijalab / sctransform

do.scale = FALSE leads to many subpopulations in UMAP #132