GSEA handling missing genes

tboen1 commented 3 months ago

In the decoupler implementation of gsea, how does decoupler handle genes present in the geneset, but missing in the ranked gene list? In particular, I'm working with a single cell dataset that contains only the top 2000 highly variable genes. Will this cause inaccurate pvalues when using decoupler.run_gsea, if the provided genesets contain genes not included in my highly variable genes? Will decoupler.run_gsea provide its own normalization, or is it recommended to remove genes in genesets that are not present in the ranked gene list?

PauBadiaM commented 3 months ago

Hi @tboen1,

Whenever a gene is present in the gene set, but missing in your ranked list, it is automatically removed. In decoupler we always assume that the input mat (your ranked gene list) contains your gene universe, and then we intersect this to your provided net (collection of gene sets).

If you are using single-cell data, I'd recommend to rather use run_ulm for its better computational scalability and its better performance compared to gsea (check decoupler's manuscript). Its okay to do feature selection, but with ulm you'll be able to keep all genes and still run enrichment scores in seconds.

Most methods implemented in decoupler expect inputs that follow a normal distribution (for gsea it should not be a problem), so I would still recommend to normalize in case you want to use other methods such as ulm.

Hope this is helpful! Let me know if you have further questions.

tboen1 commented 3 months ago

Thanks for the insight! As a follow up question, are there any recommendations for running decoupler on multiple single cell data with potential batch effects?

For example, the scRNAseq demo here: https://decoupler-py.readthedocs.io/en/latest/notebooks/progeny.html shows how to find pathway activity information for each individual cell, but presumably all the cells are coming from the same dataset.

If we have multiple scRNAseq datasets, do any methods in decoupler natively handle batch effect correction when performing activity inference? Or will we need to perform correction/integration before or after applying decoupler?

PauBadiaM commented 3 months ago

Hi @tboen1,

Usually in single-cell data integration is only done at the PCA level and never at the gene expression level itself, meaning that even if you perform integration the stored gene expression remains the same so activities will not change.

If you expect huge batch effects and your objective is to compare activities between conditions (for example healthy vs disease) I would recommend to not compute activities at the single-cell level but rather at the contrast level after performing pseudobulk. In this vignette, we show how to perform pseudobulking for each cell type and sample, how to run DEA using pyDESeq2 and how to compute activities from the obtained contrasts. In the DEA step, you can add the batch covariate to the pyDESeq2 model so that the obtained contrast statistics are corrected by it, this way the computed activities will also be corrected. Hope this is helpful!

tboen1 commented 3 months ago

Thanks for the detailed explanation! The vignette is super helpful, but I have a few follow up questions

In the vignette, the "score" assigned to each gene is the t-statistic and is consistent across all further functional analysis, i.e. mat = results_df[['stat']].T.rename(index={'stat': 'T cell'}) is defined once.

However, transcription factor inference with CollecTRI is performed with a univariate linear model (ULM), PROGENy with a multivariate linear model (MLM), and functional enrichment with overrepresentation analysis (ORA). These pairings appear to be consistent across the other vignettes.

In particular for PROGENy and CollecTRI, are these illustrative examples or are ULM and MLM recommended for these analysis, respectively? I experimented with GSEA and ORA on these networks and the results were inconsistent with the results from ULM and MLM.

Also, I notice that in the figure describing MLM, the colorbar for y_i appears to assume positive gene expression values. In the pseuodobulk vignette, MLM is called using, pathway_acts, pathway_pvals = dc.run_mlm(mat=mat, net=progeny, verbose=True) where mat contains the t-statistic scores described above. Since the t-statistic scores can be positive or negative, will these violate the assumptions for MLM or is it still a viable method for PROGENy using pseudobulk contrasts?

Apologies for the wall of text, and @PauBadiaM thanks so much for your very helpful responses, I really appreciate it.

PauBadiaM commented 3 months ago

Hi @tboen1,

However, transcription factor inference with CollecTRI is performed with a univariate linear model (ULM), PROGENy with a multivariate linear model (MLM), and functional enrichment with overrepresentation analysis (ORA). These pairings appear to be consistent across the other vignettes.

Across the vignettes I use different methods for different resources just to show how decoupler can use any of these interchangeably. Even though in the manuscript we showed that simple linear approaches may outperform other classical enrichment methods people still like to see the method they are more familiar with being used. To be honest, my recommendation would be to always use ulm, for its simplicity, stability, interpretability and scalability.

In particular for PROGENy and CollecTRI, are these illustrative examples or are ULM and MLM recommended for these analysis, respectively?

You can use any method for any resource, they are not coupled (hence the name "decoupler" hehe). However it is true that for example I would not use mlm for resources that have high overlap between sources (such as in the case of a gene regulatory network like CollecTRI) since you might run into collinearity issues. Again, my recommendation would be to just use ulm everywhere.

I experimented with GSEA and ORA on these networks and the results were inconsistent with the results from ULM and MLM.

Indeed this is something we show in the original manuscript. These changes are expected since ora can just return positive scores but there is no way to distinguish between positive and negative enrichment (unless you split your genes by sign but then you lose power). For gsea it is also expected to be different since the network weights are not considered in this method (notice how there is no weight parameter) and even though it can return positive or negative scores these might not align with what is expected (for example, a repressor TF with all its edges being negative might get a high positive score when it should get a very negative score if weights where to be considered).

Since the t-statistic scores can be positive or negative, will these violate the assumptions for MLM or is it still a viable method for PROGENy using pseudobulk contrasts?

All methods in decoupler can work with positive and negative values so it is a valid method. The only requirement for most of these methods is that the observed values of gene expression follow a normal distribution (for example DEA results or normalized expression), only gsea and gsva can also work with unormalized data since they work with ranks.

Apologies for the wall of text, and @PauBadiaM thanks so much for your very helpful responses, I really appreciate it.

No problem at all! You ask good questions that also might be helpful for other users. Let me know if you have any further questions ;)

saezlab / decoupler-py

GSEA handling missing genes #132