saezlab / decoupler-py

Python package to perform enrichment analysis from omics data.
https://decoupler-py.readthedocs.io/
GNU General Public License v3.0
157 stars 23 forks source link

Input for bulk analysis when dealing with batch effet #68

Closed merlevede closed 1 year ago

merlevede commented 1 year ago

Hello,

I am using decoupleR / decoupleR-py to quantify diverse signatures in bulk expression profiles of patients.
I previously wrote a question concerning the input format that should be provided to decoupleR when dealing with the matrix of expression (and not DEA) and I got the answer that both normalized and log normalized counts are fine.

Now, I have to deal with a batch effect and I plan to use DESeq2, to perform vst after normalisation and then remove the batch effect using limma::removeBatchEffect. So, the expression data will be in vst.

Is it fine as input? Does it break some statistical assumptions in some of the methods? If it is not possible to use vst, what do you suggest to perform quantification on a batch-corrected dataset?

Thank you for your help Jane

PauBadiaM commented 1 year ago

Hi @merlevede

You can just add the batch as a covariate to the Deseq2 model, this way you regress the effect out. You can just follow the same vignette as here, but adding your batch column name in the design_factors argument of DeseqDataSet, for example like this: design_factors=['condition', 'batch']. Hope this is helpful!

merlevede commented 1 year ago

Thanks. I actually provided only the batch information in my model: design= ~ batch. I have several clinical data (histology, stage, gender, smoking status, ...) which I did not include since I do not want to perform DEA. I only need normalized counts to perform pathway quantification.

I though that the variables included in the definition of the model only were useful only to perform DEA, but that they did not lead to big changes in the normalized counts. Do you mind if I ask then when should we use removeBatchEffect?

PauBadiaM commented 1 year ago

Hi @merlevede

It really depends on what you want to do downstream. Based on this seminal paper (https://doi.org/10.1093/biostatistics/kxv027), I would not use batch correcting tools directly on the gene expression data, rather regress the effect of the batch while modeling the statistical test you want to perform. If that is not possible, then I would use batch removing tools.

merlevede commented 1 year ago

Thanks @PauBadiaM I think I will try both because I can see that the batch is still present on umap of the expression data when I only provide the batch in the model. Could you please tell me if the input of vst would be correct for decoupleR when I use removeBatchEffect on the vst data ? Thanks in advance

PauBadiaM commented 1 year ago

Hi @merlevede, yes it would be correct also