saezlab / decoupleR

R package to infer biological activities from omics data using a collection of methods.
https://saezlab.github.io/decoupleR/
GNU General Public License v3.0
176 stars 23 forks source link

High TF activity with few significant genes #116

Closed laurie-tonon closed 2 months ago

laurie-tonon commented 5 months ago

Hi,

We use decoupleR with CollecTRI to quantify transcription factor activity in our bulk RNAseq data. We observe very strange results, where regulons have important activity with high statistical significance but when we look at the volcano plot of the genes only a handful are significant. We use the DESeq2 stat value as input, and the consensus fonction.

Here is an example for a differential analysis:

image

We see that the MYC regulon is highly repressed in our analysis. But it we look at the volcano plot of its genes:

image

Only 4 genes are significant.

Same thing if we look at SP1:

image

As a result we are not very confident in the results obtained.

Could you explain us how we can reach such enrichment score with so few genes?

Thanks a lot

PauBadiaM commented 5 months ago

Hi @laurie-tonon,

To correctly asses this you should plot the stat values instead of the Log2FC since this is what is used in the end to compute the enrichment score. Then, even if genes are not significantly changing, they are still used for the score calculation, this score just means that these genes are positively/negatively coordinated. One thing you could do is to filter by significance if it is important in your application, but I would advice against it since it reduces the background distribution of genes and results can become noisy. Hope this is helpful!

laurie-tonon commented 4 months ago

Hi @PauBadiaM,

Thanks for your help. You are right, we should plot the stat values to be correct, as these are the ones used by decoupleR. But that won't change our conclusion that we don't trust the results, as many regulons are found significantly altered while very few genes are. We tried using another metric, such as -log10*p-value, but the results are also inconsistent. Our conclusion is that we can't perform an analysis of transcription factor activity like this if we have too few differentially expressed genes. We can only calculate a score per sample and compare distributions between our conditions. Do you agree with this conclusion, or is there something else we haven't tried?

Thanks a lot

PauBadiaM commented 4 months ago

Hi @laurie-tonon,

Yes, computing scores per sample and then comparing distributions is also a valid strategy. However, I would also expect few hits if the results contain so many non-significant DEG.