Merge pangenome graphs - Githubissues

genomesandMGEs commented 2 years ago

Hi there,

Is it possible to merge pangenome graphs from independent runs? I know panaroo has that option, and would like to know if it would be possible to do so with ppanggolin. If not, could you please provide me alternatives to compare the pangenome of independent runs?

Thanks!

axbazin commented 2 years ago

Hi,

What are you trying to achieve through this comparison, exactly? Is it for example to compare the gene families and their partitions between both pangenomes, and know which family is persistent in both pangenomes, which is shell in one and persistent in the other, things like that?

Adelme

genomesandMGEs commented 2 years ago

Hey,

Thanks for the (super) quick reply! Exactly, that's what I was thinking about that.

axbazin commented 2 years ago

We do not have something that directly implements a straightforward comparison between two pangenomes (for now), however you can get that with some file comparisons. Assuming you have the latest version installed, you can do the following:

get all family sequences for both pangenomes:

ppanggolin fasta --prot_families all -p pangenome_1.h5 -o prot_pangenome_1 
ppanggolin fasta --prot_families all -p pangenome_2.h5 -o prot_pangenome_2

Those commands will write a file 'all_protein_families.faa' in the output directory provided with -o. Then, you can compare this file to the other pangenome:

ppanggolin align -p  pangenome_1.h5 --proteins prot_pangenome_2/all_protein_families.faa -o align_prot_pang2_to_pang1
ppanggolin align -p  pangenome_2.h5 --proteins prot_pangenome_1/all_protein_families.faa -o align_prot_pang1_to_pang2

You can provide --identity (default is 0.5) and --coverage (default is 0.8) thresholds for the comparison. In both your output directories 'align_prot_pang2_to_pang1' and 'align_prot_pang1_to_pang2' you will get two files: The first one called 'proteins_partition_projection.tsv' which is tab separated, and will give you a file akin to this:

The first column indicates a family id from the faa file, and the second column indicates the partition of the most similar family in the pangenome it was compared to.

And alternatively the 'input_to_pangenome_associations.blast-tab' file is a alignment file with blast-like results on the proteins vs pangenome alignment, which will give you family ids from both pangenomes directly. (there can be multiple hits)

By comparing those files, and the origin family partitions, you should be able to get what you want, I believe? If you have any question or need me to clarify something, do not hesitate !

Adelme

genomesandMGEs commented 2 years ago

Hey,

Thanks for the detailed explanation.

So, if I understood correctly, this approach will give you information about the family ids from pangenome 1 that match families in pangenome 2, right? But the classification in the 2nd column only let's you know that a given id is considered 'persistent' in pangenome 2, and may not be so in pangenome 1?

Also, family ids not listed in column 1 from the 'proteins_partition_projection.tsv' will represent family-specific ids from pangenome 1, i.e. which have no match in pangenome 2?

axbazin commented 2 years ago

Yes absolutely, you are correct for all of your points.

If you want you can play with the filters available with ppanggolin fasta, which can make things simpler for your comparison, you can do stuff like this:

ppanggolin fasta --prot_families persistent -p pangenome_1.h5 -o prot_pangenome_1

to write only the persistent gene families (in a file called 'persistent_protein_families.faa'). You can do this with all partitions, the filename will change accordingly.

Adelme

labgem / PPanGGOLiN

Merge pangenome graphs #68