gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
262 stars 33 forks source link

Panaroo merge does not produce correct pan_genome_reference.fa #145

Closed bananabenana closed 2 years ago

bananabenana commented 2 years ago

Hi, thanks for the great software. I am having a problem with the pan_genome_reference.fa output file being incorrect during panaroo merge.

I merged two pangenome runs with the following command:

panaroo-merge -d panaroo1 panaroo2 -o outdir

The run finished successfully and gave me all the correct files. However, looking at the total gene calls, the pan_genome_reference.fa file gave me 51,298 genes, while the gene_presence_absence.csv file produced 51,861 genes. This is 563 genes missing from the reference fasta file. Is this behaviour expected or an intended feature?

I repeated panaroo merge on different datasets and reproduced the same error.

For additional information, the original panaroo runs were as follows: panaroo -i *.gff3 -o panaroo1 --clean-mode sensitive -a core --aligner mafft --no_clean_edges --core_threshold 0.98 --merge_paralogs --remove-invalid-genes

I guess these sequences can be extracted using the homologue locus tags from the gene_presence_absence.csv file and the original .gff3 files in the mean time.

Thanks

bananabenana commented 2 years ago

Okay saw old issues RE paralogues https://github.com/gtonkinhill/panaroo/issues/137

However, could you please clarify why paralogues are seemingly not equivalently merged in the presence_absence.csv like they are in pan_genome_reference.fa when the --merge_paralogs option is used? Or am I misunderstanding the reason for the differences?

gtonkinhill commented 2 years ago

Hi,

Only a single copy of each paralog is given in the pan_genome_reference.fa file as you mentioned.

It is also necessary to use the --merge_paralogs option when merging graphs or the merged graph will not attempt to combine paralogous genes located in different regions of the pangeome. I think this might be causing the discrepancy you observed.

bananabenana commented 2 years ago

Hi,

It is also necessary to use the --merge_paralogs option when merging graphs or the merged graph will not attempt to combine paralogous genes located in different regions of the pangeome. I think this might be causing the discrepancy you observed.

Okay that makes sense. Thank you for that tidbit! Re-running!