gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
271 stars 33 forks source link

Distinction between gene cluster names separated by "~~~" and "group_##". #254

Closed GeorgiM26 closed 1 year ago

GeorgiM26 commented 1 year ago

Hello there, I am having trouble discerning the difference between the labels for gene clusters in the gene_presence_absence.csv file.

Based on what I've read from previous resolved issues and documentation, my current understanding is that if the name for said gene sequence in the gff file is unique (i.e., galE and galE_1), then as long as the sequence is determined to be in the same cluster, then the ~~~ delimiter will be used in the naming.

In contrast, for "group_##" naming, labels clusters as 'group' if the same gene name is duplicated in multiple different clusters. In this case, does this mean that the same gene name (ex: rpoN) has been found as the sole unique name for two (or more) entirely different clusters (not paralogs; not sharing structural similarity)?

Apologies if I missed some critical information about this or if this question seems redundant, but I hope some clarification can be offered on the basis of how to distinguish this.

Thank you for the awesome tool!

Georgi.

gtonkinhill commented 1 year ago

Hi Georgi,

You're correct in the description of the '~~~' separator. However, currently the "group_##" is also used for paralogs that are annotated with the same name (or no name). In a future version, I'm planning on introducing a paralog ID to make it easier to match up paralogous gene clusters. At the moment this information is stored in the final_graph.gml file.

GeorgiM26 commented 1 year ago

Alright! Thank you so much for the clarification. Georgi