saezlab / decoupler-py

Python package to perform enrichment analysis from omics data.
https://decoupler-py.readthedocs.io/
GNU General Public License v3.0
145 stars 21 forks source link

Inferring TF activity using custom GRN #133

Closed AmelZulji closed 5 days ago

AmelZulji commented 2 weeks ago

Hi,

I have few questions related to using GRN inferred by Pando on 10x multiome data.

import decoupler as dc
import pandas as pd

sct_norm = pd.read_csv('test/pando_demo/sct_norm.csv')
sct_norm = sct_norm.set_index('barcode')
sct_norm.head()
#>                     AL627309.1  AL627309.5  LINC01409  LINC01128  LINC00115  \
barcode                                                                       
AAACAGCCAAACCTTG-1         0.0         0.0   0.000000        0.0        0.0   
AAACAGCCAACAACAA-1         0.0         0.0   0.000000        0.0        0.0   
AAACAGCCACAATGTT-1         0.0         0.0   0.000000        0.0        0.0   
AAACAGCCATCGTTCT-1         0.0         0.0   0.693147        0.0        0.0   
AAACATGCAACTGGGA-1         0.0         0.0   0.000000        0.0        0.0   

                    FAM41C  SAMD11  NOC2L  KLHL17  AL645608.7  ...    MT-CYB  \
barcode                                                        ...             
AAACAGCCAAACCTTG-1     0.0     0.0    0.0     0.0         0.0  ...  0.000000   
AAACAGCCAACAACAA-1     0.0     0.0    0.0     0.0         0.0  ...  1.098612   
AAACAGCCACAATGTT-1     0.0     0.0    0.0     0.0         0.0  ...  1.386294   
AAACAGCCATCGTTCT-1     0.0     0.0    0.0     0.0         0.0  ...  0.000000   
AAACATGCAACTGGGA-1     0.0     0.0    0.0     0.0         0.0  ...  0.000000   

                    BX004987.1  AC145212.1  MAFIP  AC011043.1  AL354822.1  \
barcode                                                                     
AAACAGCCAAACCTTG-1         0.0         0.0    0.0         0.0         0.0   
AAACAGCCAACAACAA-1         0.0         0.0    0.0         0.0         0.0   
AAACAGCCACAATGTT-1         0.0         0.0    0.0         0.0         0.0   
AAACAGCCATCGTTCT-1         0.0         0.0    0.0         0.0         0.0   
AAACATGCAACTGGGA-1         0.0         0.0    0.0         0.0         0.0   

                    AL592183.1  AC240274.1  AC007325.4  AC007325.2  
barcode                                                             
AAACAGCCAAACCTTG-1         0.0         0.0         0.0         0.0  
AAACAGCCAACAACAA-1         0.0         0.0         0.0         0.0  
AAACAGCCACAATGTT-1         0.0         0.0         0.0         0.0  
AAACAGCCATCGTTCT-1         0.0         0.0         0.0         0.0  
AAACATGCAACTGGGA-1         0.0         0.0         0.0         0.0  

[5 rows x 23779 columns]

modules = pd.read_csv('test/pando_demo/pando_modules.csv')
modules.head()
#>      tf    target  estimate  n_regions  n_genes  n_tfs  \
0  ADNP    ADGRV1  0.532202          8        8     22   
1  ADNP  C19orf25  0.030300          3        8      6   
2  ADNP     CNIH3 -0.162570          1        8      3   
3  ADNP      FNTB -0.044662          3        8      5   
4  ADNP      SIAE  0.190574          5        8      9   

                     regions          pval      padj  
0     chr5-90680787-90681314  6.477065e-08  0.000007  
1      chr19-1523580-1524073  2.429566e-04  0.014453  
2   chr1-224565032-224565364  1.144737e-06  0.000106  
3    chr14-64957010-64957939  1.011267e-03  0.048290  
4  chr11-124725297-124726106  2.131168e-04  0.012900  
test = dc.run_ulm(
    mat=sct_norm,
    net=modules,
    source='tf',
    target='target',
    weight='estimate',
    verbose=True
)
#> 745 features of mat are empty, they will be removed.
#> Running ulm on mat with 4849 samples and 23034 targets for 291 sources.
modules['tf'].unique().shape
#> (660,)

Thank you and kind regards, Amel

PauBadiaM commented 2 weeks ago

Hi @AmelZulji,

Here I assume estimates are equivalent to wights in CollecTRI. However, I am not sure whether I min-max-scale scale them so that the estimates are in [-1,1] range?

Indeed, you can use these estimates as weights. However, through different benchmarks that we have done we have observed that the magnitude of the weights do not provide much information but rather their direction (sign). For this reason, I'd recommend to use as weight just the sign of pando's estimates. You can do:

import numpy as np
modules['weight'] = np.sign(modules['estimate'])

Here decoupler runs with 291 TFs. But the provided models by pando have 660 TFs. What causes that? is it possible that (all) targets of TFs are dropped while filtering empty features from mat?

It can be two things. First, that many target genes are empty, these get automatically removed by decoupler. The other is that after filtering empty features, many TFs have less than 5 target genes, which decoupler ignores. This is controlled by the parameter min_n=5, which you could set to a lower value, but I would not go lower than 3 (if it is only 2 genes, is it really a gene set?). This min_n is important to make sure that the obtained activities are robust and not just driven by just 1 or 2 genes.

Hope this is helpful! Let me know if you have more questions ;)

AmelZulji commented 2 weeks ago

Thank you for the quick reply, @PauBadiaM!

For this reason, I'd recommend to use as weight just the sign of pando's estimates.

Will I lose information/interpretability in that way since estimates are dataset specific?

In addition, once the scores are computed, would you greenlight using them as another assay/layer to compute differences between TF/regulon activity lets say between 2 conditions? If yes, do you have suggestions which way to go (You mentioned in previous questions that you had to reimplement scanpy's functionality for DEA)?

PauBadiaM commented 2 weeks ago

Will I lose information/interpretability in that way since estimates are dataset specific?

Not really since the most "valuable" specific information is the specific TF-G interaction and its mode of regulation (just the sign).

In addition, once the scores are computed, would you greenlight using them as another assay/layer to compute differences between TF/regulon activity lets say between 2 conditions? If yes, do you have suggestions which way to go (You mentioned in previous questions that you had to reimplement scanpy's functionality for DEA)?

You can compute activities at the observation level (for each single-cell) and then apply statistics on it using for example the decoupler function rank_sources_groups.

This works well for "marker" extraction, one cell type vs the rest but not so much if you want to compare conditions since p-values will be overinflated. This is because each cell is treated as a sample. We know that single cells within a sample are not independent of each other, since they were isolated from the same environment. If we treat cells as samples, we are not testing the variation across a population of samples, rather the variation inside an individual one. Moreover, if a sample has more cells than another it might bias the results.

For this reason, I would recommend generating pseudobulk profiles for each sample and cell type, compute differential expression analysis and infer TF activities from the obtained contrast statistics. You have a detailed vignette showcasing how to do this here.

Let me know how it goes!

AmelZulji commented 5 days ago

Thank you for the detailed clarification, Pau!

It works as expected.

Regards