Closed luobosi closed 8 months ago
Hi,
Thanks for reaching out. Panaroo was designed mainly to work with a single species and although we have had some success combining different species this has not been tested extensively.
You could look at merging the individual species runs using the merge command outlined here
In this case I would recommend enabling the --merge_paralogs
command and installing the latest development version as we have included some important updates recently. This should be released soon but can be installed now using
pip install git+https://github.com/gtonkinhill/panaroo@devel
This should give you consistent gene names but you would need to check that the resulting pangenome graph and gene clusters looked reasonable. You could use cytoscape for this.
In terms of calling 'enriched' genes, this is quite a tricky problem. It is important to be wary of the impact of population structure on your results.
Dear Panaroo software development team and community,
I am reaching out with some questions regarding the use of Panaroo that have arisen during my research. Your insights or thoughts on these matters would be greatly appreciated.
Context of Application:
I am conducting a pan-genome analysis on a relatively large group of species, comprising 15 different species. Using Panaroo, I have generated pan-genome assemblies for
individual species
and a combined analysis with GFF filesfrom all 15 species
as input.My objective is to perform functional enrichment analysis using the enricher function from the R package clusterProfiler, which requires two gene lists: an input list and a background set. To compare the enrichment results for different gene types (core, shell, cloud) across species, I have chosen
the non-redundant gene set
from the combined analysis of all 15 species as thebackground set
, and the gene list of a specific type from an individual species as theinput list
(e.g., the core gene list of species A).Specific Issue:
During this process, I have noticed that the gene names in the
pan_genome_reference.fa
file from the pan-genome analyses (individual species and the combined analysis) are not entirely consistent. For example, in the context of GO enrichment analysis, consider the following examples:The merging of gene clusters results in a change of gene names in the combined file compared to their unique names within individual species. For enrichment analysis,
consistent gene naming
is typically required, and clearly, the stringpucF
is not equivalent topucF~~~pucF_2~~~pucF_1
, as they represent a containment relationship.This inconsistency has led me to a loss of confidence in the downstream enrichment analysis based on pan-genome results. To put it frankly, whether I opt to
rename the genes
,decompose pucF~~~pucF_2~~~pucF_1 into three separate entries
based on the "~~~" delimiter, orde-duplicate to retain only a single line
, it seems that any of these approaches would alter the distribution of GO terms, which is both disheartening and confusing.I am keen to learn how others in the community address this issue and how one might approach the resolution of
gene naming
and enrichment analysis challenges.Panaroo Command:
I look forward to your valuable feedback and thank you in advance for your assistance.
Best regards, luobosi