satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
208 stars 33 forks source link

Workflow for SCTransform and merged samples #182

Closed Evenlyeven closed 7 months ago

Evenlyeven commented 7 months ago

Thank you for developing SCTransform! I do prefer it over the Log Normalization method in many cases. After searching discussions in Issues, I still have some confusion about the recommended workflow related to SCTransfrom and merged samples.

Most scRNA-Seq datasets I work with are sequenced in the same batch, and we seldomly see batch effects across samples. As a result, we don't usually need to correct batch effects by integration, we simply merge the samples together then conduct SCTransform and the following dimensional reduction steps. This is consistent with what @ChristophH suggested in this issue #32.

However, I recently read that 'SCTransform learn the model of technical noise (within the experiment)', I assume experiment here means sample since even within one sequencing run, the sequencing depth can still vary, right? This makes me think that running SCTransform on individual samples makes more sense. But I am not sure whether scale.data calculated based on two different SCTransform models are comparable or not. According to here, they are not comparable. So I am not sure about this route (SCTransform on individual sample then do dimensional reduction using scale.data result from different SCT models). But according to here, the results will be similar either running SCTransform separately or on the merged object. This seems conflict with the beginning of this paragraph. (?)

I feel like the best way to work with samples without batch effects is: run SCTransform on each samples separately --> correct count, data, scale.data slots --> dimensional reduction using corrected scale.data slot. However, it looks like both the merge function you provided for SCTAssay objects and the PrepSCTFindMarkers function correct count and data slots in SCT assay, but not scale.data slot. I wonder if there are reasons why?

Please correct me if I said anything that is not accurate above.

Thank you very much for your time in advance, any input from you will be appreciated!

saketkc commented 7 months ago

Hi @Evenlyeven, if you have multiple samples and when you run SCTransform on all samples as one (all merged) and do not observe any batch effect, you need not go the integration + PrepSCTFindMarkers step. When you do this, SCTransform will learn one model for all the samples and use the median sequencing depth of the entire dataset to estimate the corrected counts (stored in counts slot of SCT assay).

PrepSCTFindMarkers is designed to run DE when you have multiple samples and there is evidence for batch effect. This function works on the counts and data slot and recorrects the counts using the minimum of median UMI across samples using the individually learned model. See the SCT v2 paper for details, but very vriefly this results in lowest FPR for a given FDR. The residuals across two models are not comparable (so you cannot run DE on scale.data slot for example), but if there is no batch effect I expect these to be comparable - which means you should be able to run downstream processing (Clustering/UMAP) on the merged SCT assay.

Hope this helps!

caodudu commented 6 months ago

Hi @Evenlyeven, if you have multiple samples and when you run SCTransform on all samples as one (all merged) and do not observe any batch effect, you need not go the integration + PrepSCTFindMarkers step. When you do this, SCTransform will learn one model for all the samples and use the median sequencing depth of the entire dataset to estimate the corrected counts (stored in counts slot of SCT assay).

PrepSCTFindMarkers is designed to run DE when you have multiple samples and there is evidence for batch effect. This function works on the counts and data slot and recorrects the counts using the minimum of median UMI across samples using the individually learned model. See the SCT v2 paper for details, but very vriefly this results in lowest FPR for a given FDR. The residuals across two models are not comparable (so you cannot run DE on scale.data slot for example), but if there is no batch effect I expect these to be comparable - which means you should be able to run downstream processing (Clustering/UMAP) on the merged SCT assay.

Hope this helps!

Hi saketkc, Thanks for your patient suggetions. I have read several issues on applying SCTransform in multiple samples, so I try to summarize it--if samples are biologically heterogeneous or under different treatments, we should run SCTranform on merged samples(#55), and if samples are technically noisy, we should run SCTransform on them seperately(#6116). Is this a reasonable decision? Best, Cao