saezlab / decoupleR

R package to infer biological activities from omics data using a collection of methods.
https://saezlab.github.io/decoupleR/
GNU General Public License v3.0
176 stars 23 forks source link

GSEA_analysis #99

Closed mms100 closed 7 months ago

mms100 commented 9 months ago

Many thanks for the helpful tool.

For applying GSEA, you have mentioned that run_ulm or run_mlm are more preferable compared to run_gsea.

so I would like to know the difference between applying the following:

1- run_ulm and run_mlm and I will add weights =1 in my dataframe 2- get_ora_df() 3- run_ora()

on the DEGs result that I have from MAST, BTW it don't have t-stat as DEseq and the out put of MAST is pvals, FDR, logfc.

Many thanks in advnace, Mohamed. If

PauBadiaM commented 8 months ago

Hi @mms100,

Sorry for the late reply, somehow I forgot to reply to your issue!

  1. You can run ulm or mlm with weights = 1 in case your gene sets are weightless, here the assumtion will be that each gene has the same "importance" for the enrichment score inference. Then, the difference between the two methods is that ulm treats each gene set independently while mlm accounts for all of them at the same time during fitting. However, in case your gene sets are co-linear (meaning, they have a lot of shared targets) it is better to use ulm since mlm might not even be able to run (cannot inverse a matrix of co-linear covariates).
  2. get_ora_df is a function available in the python version of decoupler. It takes as input a dataframe containing genes as index and a contrast gene level statistic to be used for the enrichment analysis (if using deseq2 we recommend to use the stat column). You can see an example in this vignette. This function also returns extra results such as the overlap ratio and the odds ratio.
  3. Instead, run_ora computes the fisher exact test at the observation level (cell or sample). It does this by selecting the % of expressed genes for each observation and then running the test for each gene set.

For MAST you could use the resulting logfcs as input for enrichment analysis. Hope this is helpful! Let me know if you need anything else.

mms100 commented 8 months ago

Hi @PauBadiaM

Many thanks for the detailed response.

Regarding your suggestion of using logfcs of MAST, can I apply the values comming from the following equation as input for get_ora_df funciton :

MAST_contrast_df$Ranks = sign(MAST_contrast_df$logFC) * -log10(MAST_contrast_df$PValue)

Lastly, if you think that using the ranks values is a good idea when I am applying get_ora_df funciton, how to know if the funciton will use the Ranks values and not the logFCs, if both columns present in the input dataframe.

Thanks alot, Mohamed

PauBadiaM commented 8 months ago

Yes that would be reasonable. Then, regarding how to use the ranks, the function just takes the genes that are in the index, so you need to manually filter the dataframe by whichever metric you prefer, for example the top 100 genes, then you pass that dataframe to the function decoupler.get_ora_df.

mms100 commented 8 months ago

Okay, thanks alot for your apprecitaed input. Mohamed

mms100 commented 8 months ago

This sould be my last question =D:

concerning the results decoupler.get_ora_df, why all the combined scores are +ve while I have provieded all the DEGs which has -/+ve logfcs, or shall I split the DEGs into two dataframs +ve logfcs df and -ve logfcs df and run df.get_ora_df twice to report the upregulated pathways and down regulated pathways sperately.

And just to clarify is the Overlap ratio and Odds ratio means the amount of genes that are included in the geneset that I have provided in the net

Many thanks again, Mohamed.

PauBadiaM commented 8 months ago

You can ask as many questions as you want ;) The thing about ora is that it only provides an undirected enrichment score, you can combine + and - genes and see the overall enrichment, or you can split your genes by sign and run the enrichment separately. Another alternative is to use methods that do provide direction and significance such as ulm. The overlap ratio is how many genes in your df are present in a given gene set. The odds ratio is the relative chance of the gene set being present given the given list of genes, you can read more about it here.

mms100 commented 8 months ago

Many thnx I will try to apply this.