Closed AlcaArctica closed 1 year ago
Hi AlcaArctica,
I am not one of the developers but have a question for you: If you look up which CDSs that are in the yibR gene cluster (you can find this information e.g. in the gene_presence_absence_roary.csv file) are all of the genes in that cluster annotated as yibR? It could be that they are highly homologous and that all of them actually are yibR but that they just haven't gotten that name in the prokka annotation, this happens from time to time and is mentioned under Quick Start>Running Prokka in the panaroo docs:
"Panaroo is designed to correct for many sources of error introduced during annotation. This includes refinding genes in cases were there are inconsistencies in the annotation of the same genetic sequence in different genomes. It is also possible to force Prokka to use the same gene model for each genome. This will improve the consistency of annotations but any biases present in the model will be present in all genome annotations."
I don't know if this is the problem in your case but I have observed this in my own analyses so it came to my mind when I read your post.
BW, Kaisa
It seems (as Kaisa has said) that this is a result of genes belonging to the yibR
pangenome gene cluster not being annotated as such in most of the initial annotations input into panaroo. This is mostly likely a result of the pre-panaroo annotation pipeline struggling to identify the yibR
gene in some cases, but you should probably examine the yibR
cluster, and the sequences within it, to make sure that the gene cluster has been correctly annotated as yibR
in the cases where it has been, and also that cases where isolates' genes belonging to this cluster have not been annotated as yibR
are not the result of, for example, premature stop codon mutations, which may affect the biological interpretation of your results.
The best way to check this is to examine the yibR
cluster in the final pangenome graph, documentation here. From this graph, you should be able to identify all of the genes present within the yibR
cluster by their panaroo gene id, such as 1_0_1534
and 2_0_3915
. Using all of these ids will allow you to identify the original annotations, sequences, and ids using column 3 of the gene_data.csv
output file.
That can be a little time-consuming, particularly if you are not familiar with cytoscape and/or do not have it installed, so a quicker option is to lookup the yibR
cluster in the gene_presence_absence.csv
file (which contains a little more information than the .Rtab
file) and finding each isolate's original annotation ids in its respective column. From these annotation ids, you should be able to extract the original annotation entry from the isolate's .gff
.
Hopefully this helps clear things up! Let us know if you run into any other problems.
Dear developers, I have just tried out panaroo and found it easy to install, use and interpret. However, I have found some discrepancies with the information in the
gene_presence_absence.Rtab
file. This file shows:Therefore I would assume yibR is present in all samples (Salmonella draft genomes). However, this is not the case:
It is only present in two samples! This is also reflected in the
gene_data.csv
file:I have observed this behaviour several times now. What is going on? Thank you for your support!