saezlab / decoupleR

R package to infer biological activities from omics data using a collection of methods.
https://saezlab.github.io/decoupleR/
GNU General Public License v3.0
176 stars 23 forks source link

Input for bulk analysis: taking or not the log of the normalized counts? #89

Closed merlevede closed 11 months ago

merlevede commented 1 year ago

Hello,

I am using decoupleR / decoupleR-py for a few weeks to quantify diverse signatures in bulk expression profiles of patients with lung cancer. Thank you for your tool, it is very convenient to be able to test a set of methods in a unified framework!

As input, I provide the gene expression matrix, with the log normalized counts, but I am wondering if it is correct. Maybe, it should rather be just the normalized counts. Indeed, it is mentionned in the doc that the input to decoupleR should be the normalized counts or the results of DEA.

Thus, I ran decoupleR-py the same way with the normalized counts. I observed differences in the results depending on which input was provided but these changes are limited with most of the tested databases like MSigDB c8 / hallmarks or progeny. Nevertheless, the results (the highest scores) obtained when considering LIANA and TF gene sets are very different (with the methods ulm, wsum and wmean).

Could you please confirm if we should or not take the log of the normalized counts? Also, do you have an idea why the differences due to the input matrices are bigger for some databases (like LIANA and Collectri) compared to others (like MSigDB e.g.)? Maybe this is specific to my dataset...

Thank you for your help.

PauBadiaM commented 1 year ago

Hi @merlevede

You can use both the log or unlog normalized counts, as long as they are normalized it's good. The differences are due to the sources (the "gene sets") that are included in each database. For the method selection I would recommend to stick to ulm since we have seen its the most reliable method in our benchmarks and also provides the directions of enrichment plus its significance in a single score. Hope this is helpful!

merlevede commented 1 year ago

Thanks a lot for your answer @PauBadiaM ! Maybe my point about the differences observed in the results depending on the provided input was unclear. I mean, when I do the analysis with a specific method, the results vary depending on the provided input (normalized or log normalized counts), given a specific database / gene sets. The results vary "a lot" with the LIANA and TF databases, tested independently. This holds true with specific methods including ulm. The results do not seem to vary (or very limited) with MSigDB c8 / hallmarks or progeny, whatever the method.

Thanks for your input on ulm!

PauBadiaM commented 11 months ago

Hi @merlevede

Since the LIANA resource checks pairs of genes (Ligand and Receptor), depending on which preprocessing you use results are more unstable than other resources that contain bigger gene sets such as MSigDB (dozens or hundreds of genes per gene set). You can increase the minsize parameter to filter out gene sets that have low number of targets that match your input mat, this should stabilize the results, but you will also lose gene sets, so there is a tradeoff. For the LIANA resource unfortunately you need to keep the value to minsize < 2 because any value higher than that would remove all interactions.

Hope this is helpful!