gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
269 stars 34 forks source link

Interpretation of Gene Names and Downstream Analysis in Panaroo Pan-genome Studies #262

Closed luobosi closed 8 months ago

luobosi commented 10 months ago

Dear Panaroo software development team and community,

I am reaching out with some questions regarding the use of Panaroo that have arisen during my research. Your insights or thoughts on these matters would be greatly appreciated.

Context of Application:

I am conducting a pan-genome analysis on a relatively large group of species, comprising 15 different species. Using Panaroo, I have generated pan-genome assemblies for individual species and a combined analysis with GFF files from all 15 species as input.

My objective is to perform functional enrichment analysis using the enricher function from the R package clusterProfiler, which requires two gene lists: an input list and a background set. To compare the enrichment results for different gene types (core, shell, cloud) across species, I have chosen the non-redundant gene set from the combined analysis of all 15 species as the background set, and the gene list of a specific type from an individual species as the input list (e.g., the core gene list of species A).

Specific Issue:

During this process, I have noticed that the gene names in the pan_genome_reference.fa file from the pan-genome analyses (individual species and the combined analysis) are not entirely consistent. For example, in the context of GO enrichment analysis, consider the following examples:

#specie A core
mreC    GO:0003674,GO:0005488,GO:0005515,...
#combine name
mreC~~~mreC_1~~~mreC_2  GO:0003674,GO:0005488,GO:0005515,...

#specie B cloud
rplS    GO:0003674,GO:0003735,GO:0005198,...
#combine name
rplS~~~rplS_1~~~rplS_2  GO:0003674,GO:0003735,GO:0005198,...

speciesC combine
pucF    pucF~~~pucF_2~~~pucF_1
pucF    pucF_2
pucF    pucF~~~pucF_2~~~pucF_1~~~pucF_3

The merging of gene clusters results in a change of gene names in the combined file compared to their unique names within individual species. For enrichment analysis, consistent gene naming is typically required, and clearly, the string pucF is not equivalent to pucF~~~pucF_2~~~pucF_1, as they represent a containment relationship.

This inconsistency has led me to a loss of confidence in the downstream enrichment analysis based on pan-genome results. To put it frankly, whether I opt to rename the genes, decompose pucF~~~pucF_2~~~pucF_1 into three separate entries based on the "~~~" delimiter, or de-duplicate to retain only a single line, it seems that any of these approaches would alter the distribution of GO terms, which is both disheartening and confusing.

I am keen to learn how others in the community address this issue and how one might approach the resolution of gene naming and enrichment analysis challenges.

Panaroo Command:

panaroo -i $input_dir/*.gff -o $output_dir/ --clean-mode strict -a core --aligner mafft --core_threshold 0.99 -t $cpus --merge_paralogs
Version: Panaroo 1.3.4

I look forward to your valuable feedback and thank you in advance for your assistance.

Best regards, luobosi

gtonkinhill commented 10 months ago

Hi,

Thanks for reaching out. Panaroo was designed mainly to work with a single species and although we have had some success combining different species this has not been tested extensively.

You could look at merging the individual species runs using the merge command outlined here

In this case I would recommend enabling the --merge_paralogs command and installing the latest development version as we have included some important updates recently. This should be released soon but can be installed now using

pip install git+https://github.com/gtonkinhill/panaroo@devel

This should give you consistent gene names but you would need to check that the resulting pangenome graph and gene clusters looked reasonable. You could use cytoscape for this.

In terms of calling 'enriched' genes, this is quite a tricky problem. It is important to be wary of the impact of population structure on your results.