satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
206 stars 33 forks source link

SCTransform on multiple batches #55

Open Nusob888 opened 4 years ago

Nusob888 commented 4 years ago

Hi Christoph, Thanks for all your work.

I wanted to get your thoughts on a potential issue of per batch sctransform that I haven't seen described elsewhere.

There has been lots of debate over the usage of sctransform on multiple samples/batches:

However I would really appreciate your thoughts on a phenomenon that I have observed, that has been touched upon on the recent Theis group preprint on integration benchmarking (https://www.biorxiv.org/content/10.1101/2020.05.22.111161v1).

I have observed that if you perform sctransform on multiple batches with different compositions of cell types (e.g. HEK cells in one vs PBMCs in another), that both seurat v3 integration and Harmony fail to distinguish between the cell types and will over-integrate the two biologically distinct cell types.

After benchmarking a few approaches, I found that this was corrected by either using the traditional seurat normalisation or by performing sctransform on a pre-merged count matrix of HEK cells and PBMCs and then performing Harmony integration; in these scenarios, the cells separate very well (images attached).

If I understand the approach sctransform takes, it learns and corrects for sequencing depth at a per-cell level per matrix input. Therefore, if one has sufficient heterogeneity between the samples to be integrated, this means that each model is potentially learned on a different set of highly abundance genes.

So my questions are: 1) In this scenario, would the integration of one fitted model vs another become too different to retain sample specific cell types?

2) Is performing sctransform on a merged matrix of two different batches (with potentially different library sizes) a valid approach?

Sorry for the long winded nature of the post. Your thoughts would be greatly appreciated. Many thanks in advance

integration example.pdf

ChristophH commented 4 years ago

Hi, When you run sctransform normalization you are standardizing the expression of each gene in each cell relative to all the other cells in the input matrix. That means that if you sctransform-normalize HEK and PBMC separately you loose the baseline differences between them (similar to a gene-wise scaling before merging). The approach of first normalizing each sample (matrix) is only advisable if your samples have roughly the same celltype compositions and you want to remove batch effects that are characterized by simple shifts in mean expression.

Nusob888 commented 4 years ago

Hi, When you run sctransform normalization you are standardizing the expression of each gene in each cell relative to all the other cells in the input matrix. That means that if you sctransform-normalize HEK and PBMC separately you loose the baseline differences between them (similar to a gene-wise scaling before merging). The approach of first normalizing each sample (matrix) is only advisable if your samples have roughly the same celltype compositions and you want to remove batch effects that are characterized by simple shifts in mean expression.

Thank you Christoph for the reply. This explains the observation nicely.

In that case, do you have a recommendation of what to do when trying to integrate batches of potentially varying cell type composition?

Would my second approach of running sctransform on a merged matrix of all batches be appropriate in this case? Perhaps even choosing batch assignments as a latent variable to regress.