SCTransform on multiple batches

Nusob888 commented 4 years ago

Hi Christoph, Thanks for all your work.

I wanted to get your thoughts on a potential issue of per batch sctransform that I haven't seen described elsewhere.

There has been lots of debate over the usage of sctransform on multiple samples/batches:

The seurat vignettes suggest to perform it on a per batch basis before integration
I noted your response to issue #32, where you suggest merging samples within a batch, which made sense to me.

However I would really appreciate your thoughts on a phenomenon that I have observed, that has been touched upon on the recent Theis group preprint on integration benchmarking (https://www.biorxiv.org/content/10.1101/2020.05.22.111161v1).

I have observed that if you perform sctransform on multiple batches with different compositions of cell types (e.g. HEK cells in one vs PBMCs in another), that both seurat v3 integration and Harmony fail to distinguish between the cell types and will over-integrate the two biologically distinct cell types.

After benchmarking a few approaches, I found that this was corrected by either using the traditional seurat normalisation or by performing sctransform on a pre-merged count matrix of HEK cells and PBMCs and then performing Harmony integration; in these scenarios, the cells separate very well (images attached).

If I understand the approach sctransform takes, it learns and corrects for sequencing depth at a per-cell level per matrix input. Therefore, if one has sufficient heterogeneity between the samples to be integrated, this means that each model is potentially learned on a different set of highly abundance genes.

So my questions are: 1) In this scenario, would the integration of one fitted model vs another become too different to retain sample specific cell types?

2) Is performing sctransform on a merged matrix of two different batches (with potentially different library sizes) a valid approach?

In this approach, I would anticipate that sctransform would essentially treat the differing batch library sizes in the same way it would treat a cell that had been sequenced at less depth than another within the same sample.

Sorry for the long winded nature of the post. Your thoughts would be greatly appreciated. Many thanks in advance

integration example.pdf

ChristophH commented 4 years ago

Hi, When you run sctransform normalization you are standardizing the expression of each gene in each cell relative to all the other cells in the input matrix. That means that if you sctransform-normalize HEK and PBMC separately you loose the baseline differences between them (similar to a gene-wise scaling before merging). The approach of first normalizing each sample (matrix) is only advisable if your samples have roughly the same celltype compositions and you want to remove batch effects that are characterized by simple shifts in mean expression.

Nusob888 commented 4 years ago

Hi, When you run sctransform normalization you are standardizing the expression of each gene in each cell relative to all the other cells in the input matrix. That means that if you sctransform-normalize HEK and PBMC separately you loose the baseline differences between them (similar to a gene-wise scaling before merging). The approach of first normalizing each sample (matrix) is only advisable if your samples have roughly the same celltype compositions and you want to remove batch effects that are characterized by simple shifts in mean expression.

Thank you Christoph for the reply. This explains the observation nicely.

In that case, do you have a recommendation of what to do when trying to integrate batches of potentially varying cell type composition?

In this scenario, batch effects cannot be explained by simple shifts in mean expression.

Would my second approach of running sctransform on a merged matrix of all batches be appropriate in this case? Perhaps even choosing batch assignments as a latent variable to regress.

satijalab / sctransform

SCTransform on multiple batches #55