Closed gokceneraslan closed 3 years ago
I have a question which I guess related to SCIRA, but I was wondering why is the expression of TFs not taken into account at all. It is known that they are hard to detect but if they are expressed, it's an additional evidence that they are active, isn't it?
@gokceneraslan thank you for the edits! Completely agree with them :+1: Regarding pSCIRA, the expresison of TFs it is indeed taken into account, if you explore the regulons you will see that TFs are also gene targets:
dorothea_hs = dorothea.load_regulons()
len(set(dorothea_hs.columns) & set(dorothea_hs.index))
> 1395
Another way at looking at this:
np.sum(dorothea_hs.loc[dorothea_hs.columns],axis=0)
tf
ADNP 29.0
ADNP2 31.0
AEBP2 62.0
AHR 11.0
AHRR -2.0
...
ZSCAN5A 13.0
ZSCAN9 61.0
ZXDA 36.0
ZXDB 28.0
ZXDC 46.0
Length: 1395, dtype: float64
The expression of TFs can also affect the activity of other TFs (in a positive/negative manner)
Sorry I wasn't clear enough, I was talking about why this is zero
import dorothea
dorothea_hs = dorothea.load_regulons()
cn = sorted(list(set(dorothea_hs.columns) & set(dorothea_hs.index)))
dorothea_hs.loc[cn, cn].values.diagonal().sum()
For example, I have a FOS high and FOS low clusters but when I run dorothea, I see high FOS activation signal only in FOS low cluster. So I was wondering, if FOS is highly expressed in the first cluster, why doesn't it count as evidence that FOS is active in that cluster too.
In other words, if this is a tf activity prediction method, why is the expression of the tf of interest completely ignored?
Good question! The expression of the TF of interest is ignored because TF activities do not correlate with their expression (high expression does bot mean high activity). Instead, since we use a foot-print methodology, we focus on the targets that are downstream of the TFs. If said TF is truly active, these genes should be more coordinated than the others.
A few sanity checks and minor improvements. I was particularly scared by this line:
r_genes, r_tfs, R = np.sort(regnet.index), regnet.columns, np.array(regnet)
since target gene names and regnet rows can easily go out-of-sync (e.g. if regnet rows are not alphabetically sorted for some reason.)
I also turned off use_raw by default, since most of the time it's misleading and things might go unnoticed. It's better to have an explicit option defaulting to False, which is something we also want to do in Scanpy at some point.