Closed carrere closed 4 years ago
89/7928 groups without sequence
grep -c '^>' pan_genome_reference.fa
7839
wc -l gene_presence_absence_roary.csv
7929 gene_presence_absence_roary.csv
wc -l gene_presence_absence.csv
7929 gene_presence_absence.csv
Hi,
This is poorly documented but intentional behaviour. We only return a single reference sequence for each centroid/paralog cluster. This was done to reduce duplications in the generated reference that can cause issues for read alignment.
I am hoping to improve the documentation soon. If there is a strong need for including duplicated sequences we could look at adding an option to include them.
Hi,
Sorry, but I still not understand why those groups do not have any sequence in the resulting pangenome fasta file. I understand that you take only one sequence per group (centroid) and so in the case of singleton groups (gene presents in only one copy in only one genome) you should add this gene to the resulting pangenome ? Am I right ?
Or that means these groups are paralogs of other, but in that case, how can I get this information ?
Thanks for your help.
Sebastien
Ok I think I found this information in the final_graph.gml (attribute paralog = 1, centroidID)
Yes, that's it. I will update the documentation to include this information.
As a follow up to this question regarding paralogs from the .gml file, I just want to make sure I am understanding the table output from cytoscape. What I want to be able to do is determine which group is paralogous to groups in the pangenome reference file. It looks like I can use the longCentroidID to tie a paralogous group with another group's centroid correct?
Shortened the header a little so it would be easier to see on here:
centroid | description | geneIDs | label | longCentroidID | name | paralog | seqIDs | shared name |
---|---|---|---|---|---|---|---|---|
1_1_14 | DUF792 family protein | 1_4_15 | 5442 | 1_1_14 | group_2121 | 1 | 1_4_15 | group_2121 |
If I am reading this right it, this row would be for geneID 1_14_15 which is group_2121 which is paralogous to centroid 1_1_14 in another group. What confuses me is that 'centroid' can have multiple geneIDs in it but longCentroidID only has one, what exactly is longCentroidID? Also, what is the shared name header? At first glance I would expect that to be the group name which this gene is paralogous to and is actually in the pangenome reference file but that is just me.
Not quite. I should really remove the longCentroidID
from the final output as it is mainly used to help speed things up internally. The centroid
field should allow you to match up paralogous genes. The reason it can have multiple entries is due to the family collapsing stage of the algorithm.
The shared name
field should also be ignored for the moment and will probably be removed in a later release. I'm hoping to improve the documentation for these fields soon.
Dear PANAROO team,
I started to use your tool (v1.1.2 installed through conda) few days ago and I think I found a bug (or something I do not understand :) ).
For some groups, I cannot find a reference sequence in the pan_genome_reference.fa. This groups contain only one gene (but other "singleton" groups have a sequence in the pan_genome_reference.fa so this is not the reason why I guess). And these genes are in the GFF files and gene_data.csv file.