saezlab / dorothea

R package to access DoRothEA's regulons
https://saezlab.github.io/dorothea/
GNU General Public License v3.0
133 stars 27 forks source link

Input format #15

Closed andreyurch closed 4 years ago

andreyurch commented 4 years ago

Dear developers,

What is the optimal format for Dorothea (bulk transcriptome analysis)? Is this FPKM, normalised counts, log normalised counts?? In the modern annotations, we have up to 50000 genes and many of them are not expressed. Should I filter the low expressed genes before the analysis?

christianholland commented 4 years ago

Dear @andreyurch,

thanks for your interest in our package.

Please note that the dorothea package is an experimental data package, with the main purpose to provide the TF-target interaction database (regulons). For convenience, we also developed a wrapper for the statistical method viper. However, dorothea is not limited to work only with viper, but can be used with any statistical method that aims to analyse gene sets. Hence, the format of the gene expression matrix is only dependent on the underlying statistic and not on dorothea's regulons.

If you would like to use our wrapper for viper I would suggest to use log normalised counts (e.g. logCPM). Optionally, you can also scale the data gene wise by setting vipers method argument to scale Also I strongly recommend to filter out lowly expressed genes. This step should be performed regardless of whether your annotation covers only 10,000 or up to 50,000 genes.

Best wishes, Christian