gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
271 stars 33 forks source link

Interpreting ~~~ in gene names #127

Closed gavinmdouglas closed 3 years ago

gavinmdouglas commented 3 years ago

Hi there,

I'm not sure how to interpret lines in the gene presence/absence tables with gene names separated by "~~~".

For example:

fruB~~~fruB_3~~~fruB_1~~~fruB_2 fruB;fruB_3;fruB_1;fruB_2   Multiphosphoryl transfer protein

Does this denote paralogs?

Sorry if I missed the description in the documentation. I saw in the description of this file that gene annotations that have been merged are seperated by a semicolon, which I think refers to fragmented genes that could have been misassembled in genomes (and is indicated by semi colons in the actual gene ids per genome). I think this case I've highlighted is something different, correct?

Thanks,

Gavin

gtonkinhill commented 3 years ago

Hi Gavin,

The ~~~ deliminater separates the gene names (found in the corresponding GFF files) of any sequences included in that cluster. If the resulting 'name' is unique this is kept otherwise the cluster is assigned a 'group_#' label.

The 3rd column is the set of unique annotations found in the GFF for these sequences. These are often longer and more descriptive than the gene names and are separated by ;.

Unless you have enabled the --merge_paralogs option each paralogous cluster will be given a seperate row.

I will try and improve the documentation in the next release to make this clearer as I agree it's a bit confusing at the moment.

gavinmdouglas commented 3 years ago

Hi @gtonkinhill,

Thanks for clarifying, that's much clearer.

All the best,

Gavin