saezlab / decoupler-py

Python package to perform enrichment analysis from omics data.
https://decoupler-py.readthedocs.io/
GNU General Public License v3.0
154 stars 23 forks source link

which data to use to infer TF Activity #6

Closed wariobrega closed 2 years ago

wariobrega commented 2 years ago

hello everyone,

First of all, thanks for developing decoupleR, I am using it in my project and I am finding it extremely useful.

I have more of a theoretical question regarding the TF/Pathway activity inference with the methods that you wrapped and ported to python.

Which data would you recommend to use for inferring The activity? at the moment I am using log-normalized data of all genes that are expressed in the cells (no highly variable selection, in order to get as many regulons as possible.

I don't know however if this is the correct approach. Could you shed some light on this issue?

Thanks again and keep up with the excellent work!

Daniele

PauBadiaM commented 2 years ago

Hi @wariobrega

Thanks for checking out the package! Sorry for the late response, I was on vacation.

This is a fantastic question, just so you know we don't even have a perfect solution to it. The short answer is that it really depends, here comes the long one:

In my opinion, if possible, the best way to estimate activities is at the contrast level. First you perform DEG between conditions, preferably at the cell type pseudobulk level if in single-cell, to obtain statistics at the gene level (can be logFC, t-values or anything else). The use of a statistical estimates as input makes the activity prediction robuster in theory and additionally, the fact that these go from negative to positive values allows methods like wsum to correctly estimate inhibiting sources (for example, if a repressor TF has its target genes with very low log-normalized values, meaning they are inhibited, it will get a low activity when it should be highly active, this is not a problem for methods based on linear models such as ulm or mlm though). The only downside of this is that you need enough replicates (number of samples) to be able to do it, I would say at least 3 for each group.

Inference of activity at the sample (or cell) level is also possible from the log-normalized counts, but then you might have to deal with noisy values (especially in single cell), which I would still do it for exploratory purposes. Another problem of this is that methods like wsum will not work correctly for some sources with negative edges like I explained before. One solution would be to scale the log-normalized counts (basically z-score them) in order to obtain positive and negative values. This works nice in bulk but not so much in single cell since there are many dropouts and all the genes with zeros will get assigned low negative values by default. To correct this we tried scaling only the non-zero values in single cell but the results where not that good. By the way, in case you are working with trajectories or cell fate in single-cell, another alternative is to use the velocity vectors as input for activity inference, this is something we are currently exploring.

Regarding the selection of a gene universe, be it the highly variable genes or any other of interest, I personally would be against it for activity inference, the more information that is available the better. The only time where I would do it is to speed up calculations in case you are working with a huge atlas but in the python version scalability shouldn't be a problem.

To sum up, you can use any gene statistic that you want as input for activity inference, but preferably you should use one that yields negative and positive values.

Hope this was helpful! Let me know if you have any more questons

wariobrega commented 2 years ago

Dear @PauBadiaM ,

Thanks a lot for the nice reply, super helpful :)

I sent you another couple of questions in private that are more project-specific rather than general, so I close here the Issue :)

Thanks again,

Daniele