labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
229 stars 25 forks source link

ppanggolin msa --partition core #231

Open lfdelzam opened 1 month ago

lfdelzam commented 1 month ago

Thank you for developing such a great program! I've been using ppanggolin msa and it's been helpful. Here is the command I've been working with:

ppanggolin msa -p Pangenome_graph/pangenome.h5 --partition core --source dna -o Pangenome_graph/MSA --phylo -c 20 -f

The documentation mentions: "By default it will write the strict 'core' (genes that are present in absolutely all genomes) and remove any duplicated genes."

I'm curious to learn more about how this workflow operates, especially regarding the removal of duplicated genes. Could you please provide more details on what exactly this entails? Specifically, does the program select one copy of the duplicated genes or does it only use single copy genes?

Thanks in advance

JeanMainguy commented 1 month ago

Hello,

The msa command begins by selecting gene families. When you use the --partition core argument, it will only select families that are present in all genomes. For each of these selected families, the command will then consider only the genes that are single-copy in their respective genomes. So, If a family contains multiple genes within a genome, these genes will be excluded from the MSA.

I hope this explanation clarifies the behavior of the command.

lfdelzam commented 1 month ago

So, the non single copy genes within the family or the entire family of genes are removed?

JeanMainguy commented 1 month ago

It is the non single copy genes within the family that are removed.