Closed ptrairatphisan closed 5 years ago
Hi Panuwat,
I plan to split it up a bit into separate functions. My old script isn't quite function-worthy.. Before, I wanted to check if you agree for a few things:
The geneset gmt file is read in by pasting together the name. I would propose to use filepath instead and optionally rename the files to sth simpler. Otherwise, the function is difficult to apply to new files.
The universe file is currently accessed through the column names, and if the uniprot column is missing the general mapping file is used. I would propose to change this, and include the indices for gene symbol and uniprot in the function. This way the column names are not a problem and the general mapping file can be used although a uniprot column is present if the user wants this.
In the script, I filtered out diseases by regex. This worked for me but is not the best way to do it. I would instead define a new gmt file or provide a list of pathways to include/exclude. The list would have the advantage that you could check if everything is still correctly detected for new gene sets.
Thanks for your comments Anika. Here are my thoughts:
I agree with using the full file path for each gene set as the current assignment would indeed be not possible to use other gene sets from another branch such as GO (C5) or hallmark (H). Yet, I believe we could still keep the original filenames to be consistent with the ones from MSigDB as 1) it would be easier to replace the file if there will be any update and 2) the end user wouldn't see these filenames anyway as we'll provide easier terms to call for the gene sets e.g. datasource = 'kegg', 'biocarta' or 'GO' (in the future).
I believe that there are different ways to assign and supply the universe gene set for the enrichment pipeline and your proposal is indeed more flexible. We could implement an option to select the column of the gene symbol and uniprot e.g. enrichCARNIVAL(genesymbol_col = 1,uniprot_col = 2) as you previously assigned for the genesymbol2uniprot function. Nevertheless, we should also consider the case where the users use gene symbols as the unique identifier for the whole pipeline including PKN. We have to ensure that there is no ambiguity regarding the uniprot part which the users might not be aware of in such case.
The filtering with regular expression is indeed a bit crude but still functioning well. I detected very few non-relevant entries for canonical pathways after the filtering. We could of course dig deeper and refine the inclusion/exclusion criteria for the gene sets. However, this might be quite demanding for general users to select the pathways they want one-by-one. An alternative way I'm thinking now is to let the users formulate their own regular terms for inclusion/exclusion and we turn them into regular expression for filtering. Also, we should keep at least one default filtering regular expression (maybe the current one) to generate the most quick but meaningful results for further interpretation.
Lastly, as you mentioned that you plan to split the pipeline into different functions, could you please confirm if you plan to work on this by yourself or you prefer me to take care of it? Just let me know ok?
@anikaliu I detected a few bugs in the enrichCARNIVAL function and already corrected it plus tested with CARNIVAL Example 3 - see commits.
Could you have a final look and run a small final test then merge the script to the master branch? Thanks!
@ptrairatphisan Thanks, will do!
To facilitate the interpretation of results from CARNIVAL, we also offer an easy to use enrichment analysis pipeline based on the curated gene sets from MSigDB (C2 branch) as performed in the benchmarking and IgAN study (only over-presentation analysis of CARNIVAL nodes).
Here two new CARNIVAL functions are introduced: 1) mapUniprotPKN.R to generate the list of nodes in the prior knowledge network to be used as the 'universe' for enrichment/over-representation analyses. 2) enrichCARNIVAL.R to run enrichment/over-representation analysis where the users can also sub-select the gene set from the C2 branch (kegg, biocarta, reactome)
A tutorial using CARNIVAL Example 3 (APAP, TG-GATEs, Omnipath network) was used to demonstrate the enrichment pipeline. The results are shown below for instance.