saezlab / decoupleR

R package to infer biological activities from omics data using a collection of methods.
https://saezlab.github.io/decoupleR/
GNU General Public License v3.0
176 stars 23 forks source link

rationale behind using limma for bulk RNAseq TF inference? #86

Closed erzakiev closed 1 year ago

erzakiev commented 1 year ago

In the transcription factor activity vignette the method of choice for calculating differentially expressed genes is limma.

head(deg)
#>             logFC          t      P.Value
#> RHBDL2  -1.823940 -12.810588 3.030276e-06
#> PLEKHH2 -1.568830 -10.794453 9.932046e-06
#> HEG1    -1.725806  -9.788112 1.939734e-05
#> CLU     -1.786200  -9.761618 1.975813e-05
#> FHL1     2.087082   8.950191 3.552199e-05
#> RBP4    -1.728960  -8.529074 4.904579e-05

And in the end we only use the t statistic for the TF inference.

  1. Why such a strange choice for bulk RNAseq (since limma was specifically designed for microarray data)
  2. Can we use DESeq2's stat for that purpose instead, like the one found in the results output of its analysis?

Example of a DESeq2 result:

#>head(results(txi_deseq_deseq))
log2 fold change (MLE): Group TN vs TP 
Wald test p-value: Group TN vs TP 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat    pvalue
                   <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000000001   9.35079       0.158397  0.537063  0.294932  0.768046
ENSMUSG00000000003   0.00000             NA        NA        NA        NA
ENSMUSG00000000028   3.28258       0.287198  0.913961  0.314234  0.753343
ENSMUSG00000000037   1.05720      -0.787875  1.493053 -0.527694  0.597712
ENSMUSG00000000049   0.00000             NA        NA        NA        NA
ENSMUSG00000000056   2.10253       0.874624  1.117454  0.782693  0.433807
                        padj
                   <numeric>
ENSMUSG00000000001         1
ENSMUSG00000000003        NA
ENSMUSG00000000028         1
ENSMUSG00000000037         1
ENSMUSG00000000049        NA
ENSMUSG00000000056         1
ultimatex5 commented 1 year ago

Hi', same doubt but about results obtained with edgeR: i have logFC, logCPM, F, Pvalue and FDR as columns of DEGs analysis. Which one could be used instead of t?

PauBadiaM commented 1 year ago

Hi @erzakiev and @ultimatex5,

You can use the differential expression analysis framework that you prefer. decoupleR in the end requires a gene level statistic to perform enrichment analysis but it is agnostic of how it was generated. However, we do recommend to use statistics that include the direction of change and its significance, for example the t-value obtained for limma or Deseq2 (stat). The problem with edgeR is that it does not return such statistic but you could create your own by weighting the obtained logFC by pvalue with this formula: -log10(pvalue) * logFC.

I will update the vignettes to clarify this point, thanks for pointing this out. Hope this is helpful! Let me know if you have more questions.

erzakiev commented 1 year ago

Thank you, looks ok for me, @PauBadiaM !

@ultimatex5 I'll close the issue if you don't have any remaining Q regarding edgeR?

ultimatex5 commented 1 year ago

You can close. Thank you!