satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
203 stars 33 forks source link

Correct workflow to perform SCTransform v2 on separate or merged dataset #161

Closed reneemoerkens closed 1 year ago

reneemoerkens commented 1 year ago

Hey,

Thanks for the great work!

I am struggling with finding the correct workflow for my dataset, with regards to SCTransform and performing it on my merged dataset or individual datasets as specified in the vignette (https://satijalab.org/seurat/articles/sctransform_v2_vignette.html).

A bit more detail about my experiment: I have single-cell RNA sequencing data from intestinal cells: 5 stimulation conditions for 3 cell lines = 15 samples. These 15 samples were multiplexed in a randomized way over 5 lanes in the 10x Genomics chip, giving 5 libraries after preparation (each lane/library containing a multiplexed sample of the 3 cell lines with a randomized combination of stimulation conditions). The 5 libraries were equimolarly pooled for sequencing. All experiments and library preparation were performed at the same time by the same individuals.

In my data I see some sources of variation:

In the Seurat SCTransform v2 vignette I see that the Seurat object is split based on stimulation (IFN, CTRL), the SCTransform is performed on the separate datasets and then an integration is performed since there are differences in the position of these stimulation conditions in the UMAP. Since I have three different (intended and unintended) sources of variation to account for (i.e. lanes in 10x chip, cell lines, stimulation conditions), I was wondering which of these would be the preferred workflow:

image

Basically, should I perform the SCTransform v2 on separate datasets or the complete merged dataset? And if I should split, based on which source of variation should I split the data (and before or after reciprocal PCA analysis to correct for cell line)?

Thank you very much for your help!

saketkc commented 1 year ago

Hi @reneemoerkens, I would recommend running SCTransform on a per cell line + stimulation basis. Rather than running PrepSCTFindMarker() later (so that it is faster), you can also pass scale_factor=<VALUE> where would be the minimum of the median UMI across the datasets. Your integration step will essentially integrate the stimulation/control conditions across cell lines. Hope this helps.

reneemoerkens commented 1 year ago

Thank you so much for your fast response @saketkc, that helps a lot. So to confirm, this is the workflow that you propose:

image

And after this I can use the Integrated assay for cell type annotation and SCT assay to run DE analysis (FindMarkers)

saketkc commented 1 year ago

Yes, this is what I suggested.

reneemoerkens commented 1 year ago

@saketkc, thank you again for the confirmation and taking the time to help me out! I tried the suggested workflow, but unfortunately performing integration based on cell_line + stimulation condition, results in a clustering that doesn't distinguish well anymore between these stimulations (because my stimulation condition actually largely impacts cell type composition, as it is a sort of differentiation medium). I am now inclined to go for workflow 4, in which I perform the integration only based on cell_line to improve cell type annotation, and subsequently run SCTranform v2 on the complete dataset. Does running SCTransform v2 on separate versus merged datasets make a big difference in it's ability to correct for differences in sequence depth (which is my main concern with the different lanes)?

saketkc commented 1 year ago

Since this is all same technology I wouldn't except it to make a big difference. PrepSCTFindMarkers is useful to account for seqeuncing depth difference if you are using the merge strategy. So you can run SCT v2 independently on each cell line, perform integration and then run PrepSCTFindMarkers to adjust for the fact that the corrected counts calculated by SCT were based on the fact that different samples had different median UMI (so we reestimate the corrected counts using the minimum value of median UMI across samples). Does this make sense?

In short, the integration choice is something that is very context dependent. Irrespective of which workflow you choose, you can always run PrepSCTFindMarkers before running DE to make sure your merged dataset has its corrected counts caclulcated based on one sequencing depth (which is the minimum of the median UMI across datasets).

reneemoerkens commented 1 year ago

@saketkc, I understand, and indeed, after testing different workflows I see that the differences are very small. With the different workflows I get a very similar list and number of differentially expressed genes per cluster, with similar log2FC and p_val_adj.

In the end I will do my reciprocal PCA integration based on 'cell line' to improve annotation, then subsequently perform SCT v2 on the complete dataset (which is a bit more convenient in my follow-up analysis). Apparently in my dataset, the SCTranform v2 does a good job correcting for the sequence depth differences between the lanes, without performing splitting/independent SCTransform v2/PrepSCTFindMarkers. Could you shortly explain the method integrated within the SCTransform v2 function to account for sequencing depth differences?

Thank you for all your help, very much appreciated!

saketkc commented 1 year ago

I will refer you to the v2 paper for the method details.

reneemoerkens commented 1 year ago

Thank you!

reneemoerkens commented 1 year ago

Hi @saketkc, sorry to bother you again, one final question: I am considering to perform on the complete dataset: SCTransform(Seurat_object, vst.flavor = "v2", scale.factor = 21672.5), in which 21672.5 is the lowest median UMI when comparing dataset splitted by lane or cell_line or condition. Is it correct to use scale.factor when performing SCTransform v2 on the complete dataset (instead of splitted), to correct for sequencing depth differences? Or can I better leave it out?

Also, when using scale.factor argument instead of PrepSCTFindMarkers, I don't need to pass recorrect_umi=FALSE when running FindMarkers() on a subset of the data. What is the reason you have to use recorrect_umi=FALSE (after PrepSCTFindMarkers) when analyzing a subset, because in most subset comparisons the difference in sequence depth remains a factor of technical variation, right?

saketkc commented 1 year ago

Yes you can pass a custom scale_factor (Note it is scale_factor and not scale.factor). This uses a constant scale factor (instead of estimating the median UMI from the data which might be different across datasets you are integrating). Once you do this, you can pass recorrect_umi=FALSE when running FindMarkers.

reneemoerkens commented 1 year ago

Clear, thank you!