tanaylab / metacell

Metacell - Single-cell mRNA Analysis
https://tanaylab.github.io/metacell
Other
109 stars 30 forks source link

How to deal with data with clear batch effect? #56

Closed hurleyLi closed 3 years ago

hurleyLi commented 3 years ago

Hi, I'm wondering whether MetaCell can be applied to datasets with known and clear batch effects from different samples. Normally I would use something like ComBat to adjust the batch effect, but since MetaCell takes raw counts for feature selection and downstream analyses, does the algorithm take care of variance from different batches? If not, can I somehow feed the MetaCell with normalized matrix after running the standard normalization + ComBat?

Really appreciate it! Hurley

akhiadber commented 3 years ago

Hi, The handling of batch effects is very problem dependent, unlike sampling noise. Batch effects can arise from different sources, so I'll explain how we might handle 2 example sources, but note that this is not comprehensive, and that this includes manual steps that are not part of the default pipeline. Source 1: Technical difference in experiments, such as taking an hour to handle batch A, and 5 hours for batch B. This batch effect might manifest by a stronger stress signature in batch B. The way we handle such batch effects is by the inference of gene modules (similarly to the filtering of feature genes as in https://tanaylab.github.io/metacell/articles/b-geneset_by_anchor_pbmc8k.html) and analysis of the batch differences across different gene modules. We then remove the "batchy" gene module from being feature genes, or blacklist these genes (as we do in the normal pipeline for mitochondrial genes). Source 2: Ambient RNA of X percentage. In this case, we will probably do the same as source 1, but for the highest mean and "not interesting" genes from each batch, which will remove some of the effect but of course not all. We might also calculate metacells for each batch separately (depending on how many samples you have from each batch) and infer and correct the batch effect per cells state. In either case, this involves careful supervision and not an out-of-the-box solution.

If the batch effect is known and you have your favorite method of handling it (such as ComBat), then you could input the normalized matrix, but you might have to adjust some of the default parameters to get optimal results (like the thresholds on feature genes). This should work, but has not been tested extensively. You're welcome to provide more details, to receive the most relevant suggestions.

hurleyLi commented 3 years ago

Thanks so much for the suggestion! Using normalized input seems easier, but I really like the idea of removing batchy genes by gene modules, as my situation is more close to Source 1 and I think the data will probably be cleaner this way. I will give it a try and report back what I find. Thanks!

zh-Bian commented 2 years ago

Hi, how do you feel about the Source 1, whether it could adjust the batch effect? And how about the gene module? Just like the proliferative genes or HSP genes? And whether it could adjust the batch about sequencing depth? like the 60% saturation and 80% saturation.