Closed mms100 closed 7 months ago
Hi @mms100,
Sorry for the late reply, somehow I forgot to reply to your issue!
ulm
or mlm
with weights = 1 in case your gene sets are weightless, here the assumtion will be that each gene has the same "importance" for the enrichment score inference. Then, the difference between the two methods is that ulm
treats each gene set independently while mlm
accounts for all of them at the same time during fitting. However, in case your gene sets are co-linear (meaning, they have a lot of shared targets) it is better to use ulm
since mlm
might not even be able to run (cannot inverse a matrix of co-linear covariates). get_ora_df
is a function available in the python version of decoupler
. It takes as input a dataframe containing genes as index and a contrast gene level statistic to be used for the enrichment analysis (if using deseq2 we recommend to use the stat
column). You can see an example in this vignette. This function also returns extra results such as the overlap ratio and the odds ratio.run_ora
computes the fisher exact test at the observation level (cell or sample). It does this by selecting the % of expressed genes for each observation and then running the test for each gene set.For MAST you could use the resulting logfcs as input for enrichment analysis. Hope this is helpful! Let me know if you need anything else.
Hi @PauBadiaM
Many thanks for the detailed response.
Regarding your suggestion of using logfcs of MAST, can I apply the values comming from the following equation as input for get_ora_df funciton :
MAST_contrast_df$Ranks = sign(MAST_contrast_df$logFC) * -log10(MAST_contrast_df$PValue)
Lastly, if you think that using the ranks values is a good idea when I am applying get_ora_df funciton, how to know if the funciton will use the Ranks values and not the logFCs, if both columns present in the input dataframe.
Thanks alot, Mohamed
Yes that would be reasonable. Then, regarding how to use the ranks, the function just takes the genes that are in the index, so you need to manually filter the dataframe by whichever metric you prefer, for example the top 100 genes, then you pass that dataframe to the function decoupler.get_ora_df
.
Okay, thanks alot for your apprecitaed input. Mohamed
This sould be my last question =D:
concerning the results decoupler.get_ora_df, why all the combined scores are +ve while I have provieded all the DEGs which has -/+ve logfcs, or shall I split the DEGs into two dataframs +ve logfcs df and -ve logfcs df and run df.get_ora_df twice to report the upregulated pathways and down regulated pathways sperately.
And just to clarify is the Overlap ratio and Odds ratio means the amount of genes that are included in the geneset that I have provided in the net
Many thanks again, Mohamed.
You can ask as many questions as you want ;) The thing about ora
is that it only provides an undirected enrichment score, you can combine + and - genes and see the overall enrichment, or you can split your genes by sign and run the enrichment separately. Another alternative is to use methods that do provide direction and significance such as ulm
.
The overlap ratio is how many genes in your df
are present in a given gene set. The odds ratio is the relative chance of the gene set being present given the given list of genes, you can read more about it here.
Many thnx I will try to apply this.
Many thanks for the helpful tool.
For applying GSEA, you have mentioned that run_ulm or run_mlm are more preferable compared to run_gsea.
so I would like to know the difference between applying the following:
1- run_ulm and run_mlm and I will add weights =1 in my dataframe 2- get_ora_df() 3- run_ora()
on the DEGs result that I have from MAST, BTW it don't have t-stat as DEseq and the out put of MAST is pvals, FDR, logfc.
Many thanks in advnace, Mohamed. If