saezlab / dorothea-py

Dorothea package in Python
MIT License
11 stars 3 forks source link

Sanity checks and minor improvements #2

Closed gokceneraslan closed 3 years ago

gokceneraslan commented 3 years ago

A few sanity checks and minor improvements. I was particularly scared by this line:

r_genes, r_tfs, R = np.sort(regnet.index), regnet.columns, np.array(regnet)

since target gene names and regnet rows can easily go out-of-sync (e.g. if regnet rows are not alphabetically sorted for some reason.)

I also turned off use_raw by default, since most of the time it's misleading and things might go unnoticed. It's better to have an explicit option defaulting to False, which is something we also want to do in Scanpy at some point.

gokceneraslan commented 3 years ago

I have a question which I guess related to SCIRA, but I was wondering why is the expression of TFs not taken into account at all. It is known that they are hard to detect but if they are expressed, it's an additional evidence that they are active, isn't it?

PauBadiaM commented 3 years ago

@gokceneraslan thank you for the edits! Completely agree with them :+1: Regarding pSCIRA, the expresison of TFs it is indeed taken into account, if you explore the regulons you will see that TFs are also gene targets:

dorothea_hs = dorothea.load_regulons()
len(set(dorothea_hs.columns) & set(dorothea_hs.index))
> 1395

Another way at looking at this:

np.sum(dorothea_hs.loc[dorothea_hs.columns],axis=0)
tf
ADNP       29.0
ADNP2      31.0
AEBP2      62.0
AHR        11.0
AHRR       -2.0
           ... 
ZSCAN5A    13.0
ZSCAN9     61.0
ZXDA       36.0
ZXDB       28.0
ZXDC       46.0
Length: 1395, dtype: float64

The expression of TFs can also affect the activity of other TFs (in a positive/negative manner)

gokceneraslan commented 3 years ago

Sorry I wasn't clear enough, I was talking about why this is zero

import dorothea

dorothea_hs = dorothea.load_regulons()
cn = sorted(list(set(dorothea_hs.columns) & set(dorothea_hs.index)))
dorothea_hs.loc[cn, cn].values.diagonal().sum()

For example, I have a FOS high and FOS low clusters but when I run dorothea, I see high FOS activation signal only in FOS low cluster. So I was wondering, if FOS is highly expressed in the first cluster, why doesn't it count as evidence that FOS is active in that cluster too.

gokceneraslan commented 3 years ago

In other words, if this is a tf activity prediction method, why is the expression of the tf of interest completely ignored?

PauBadiaM commented 3 years ago

Good question! The expression of the TF of interest is ignored because TF activities do not correlate with their expression (high expression does bot mean high activity). Instead, since we use a foot-print methodology, we focus on the targets that are downstream of the TFs. If said TF is truly active, these genes should be more coordinated than the others.