Are TPMs better than counts for the linear model?

saezlab / decoupleR

R package to infer biological activities from omics data using a collection of methods.

GNU General Public License v3.0

190 stars 24 forks source link

Dear developers, first I would like to thank you for the wonderful set of tools! Second, I have two questions:

I am interested in bulk RNA TF activity inference. You suggest using normalised counts for the linear model fitting. Let's imagine a situation where we have 4 genes, which are targets of TF X: A, B, C, D with the corresponding counts 10, 10, 20, 20 and mode of activation -1,-1, 1, 1. In the linear model with such parameters, we will have a strong activation signature (10,-1)(10,-1)(20, 1)(20,1) of TF X. But what if the length of the genes A, B, C ,D is 1000, 1000, 2000, 2000 respectively? If we convert counts to TPMs/FPKMs, then the expression of the genes will be equal in the cell (ex. 10,10,10,10). Then, the linear model will not indicate any activation of the TF X. There is a possibility that I do not understand something, but it seems that normalisation of the expression on the length of the genes should be a really important for the TF activity inference inside a sample...
Did you compare the performance of VIPER with the linear model fitting? Which method is bettter from your point of view?

Best regards, Andrey

Hi @andreyurch,

For TF activity inference at the sample level we leave the normalization choice up to the user. In your example, you are missing the background of genes, basically genes in your mat that do not belong to TF X (they have 0s), those are also considered in the linear model. So, if the background is in average still less than 10 after normalization, you still will get a high activity. Alternatively, you could also perform TF activity inference at the contrast level if you have well defined conditions.
Yes we did as stated in Fig 1C of the manuscript. Based on our benchmark, in general we observed that simple linear models (ulm and mlm) outperform other classic methods such as viper or gsea: This was done using our previous benchmark dataset, recently we ran again the pipeline with the KnockTF2 database which contains more perturbation experiments and we saw the same pattern: If you are interseted in the benchmarking pipeline, you can find more information here To sum up, my recommendation would be to use ulm since with mlm sometimes you can run into co-linearity issues. Hope this is helpful!

saezlab / decoupleR