gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
268 stars 34 forks source link

How to find paralog gene sequence in pan_genome_reference.fa? #304

Closed mai-rui-gao closed 1 month ago

mai-rui-gao commented 2 months ago

Dear Panaroo developers and community,

In the output documentation, it is noted that "to avoid issues with the multi-mapping of reads, paralogous gene clusters will only be represented once in this reference." I wonder if it is possible to obtain the sequences of the paralogous genes elsewhere?

In my case, I have 13,794 genes in "pan_genome_reference.fa" and 14,530 genes in "gene_presence_absence.Rtab." I am particularly interested in understanding the functions of the paralogous genes.

Any help or guidance on how to retrieve these paralog sequences would be greatly appreciated.

Thank you!

gtonkinhill commented 2 months ago

Hi,

The easiest way to investigate this is by examining the final_graph.gml file. This file includes a flag indicating whether a gene is a paralog. The clusterID can then be used to identify paralogous genes from the same family.

You can load this file in Cytoscape or analyse it programmatically using the NetworkX library in Python.

mai-rui-gao commented 2 months ago

Thank you for your reply!