gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
269 stars 34 forks source link

Wrong information in the gene_presence_absence.Rtab file #234

Closed AlcaArctica closed 1 year ago

AlcaArctica commented 1 year ago

Dear developers, I have just tried out panaroo and found it easy to install, use and interpret. However, I have found some discrepancies with the information in the gene_presence_absence.Rtab file. This file shows:

grep "yibR" gene_presence_absence.Rtab 
yibR    1    1    1    1    1    1    1    1    1    11    1    1    1    1    1

Therefore I would assume yibR is present in all samples (Salmonella draft genomes). However, this is not the case:

grep "yibR" *.gff3
11-00264.gff3:contig_1    Prodigal    CDS    1632866    1633075    .    -    0ID=DJDJOF_08100;Name=Uncharacterized protein YibR;locus_tag=DJDJOF_08100;product=Uncharacterized protein YibR;Dbxref=RefSeq:WP_000065393.1,SO:0001217,UniParc:UPI00016A098D,UniRef:UniRef100_P40824,UniRef:UniRef50_P40824,UniRef:UniRef90_P40824;gene=yibR
11-00347.gff3:contig_1    Prodigal    CDS    4187992    4188201    .    +    0ID=BMCPGJ_20725;Name=Uncharacterized protein YibR;locus_tag=BMCPGJ_20725;product=Uncharacterized protein YibR;Dbxref=RefSeq:WP_000065393.1,SO:0001217,UniParc:UPI00016A098D,UniRef:UniRef100_P40824,UniRef:UniRef50_P40824,UniRef:UniRef90_P40824;gene=yibR

It is only present in two samples! This is also reflected in the gene_data.csv file:

grep "yibR" gene_data.csv 
11-00264,contig_1,1_0_1534,DJDJOF_08100,MSNIKSYISNQKNIIQHDDFFGRRLDIALCFDHGFIMPAGVAIYSIIENNKDIDYALKQKERCQVKITL,ATGTCTAACATTAAAAGTTATATTTCAAACCAAAAAAACATAATTCAACATGATGATTTTTTTGGAAGAAGACTGGACATTGCTCTTTGCTTTGATCATGGTTTTATAATGCCAGCAGGTGTGGCAATATATTCCATTATCGAAAATAATAAAGATATTGACTACGCCTTAAAGCAAAAAGAGAGATGTCAGGTAAAAATTACTCTATAG,yibR,Uncharacterized protein YibR
11-00347,contig_1,2_0_3915,BMCPGJ_20725,MSNIKSYISNQKNIIQHDDFFGRRLDIALCFDHGFIMPAGVAIYSIIENNKDIDYALKQKERCQVKITL,ATGTCTAACATTAAAAGTTATATTTCAAACCAAAAAAACATAATTCAACATGATGATTTTTTTGGAAGAAGACTGGACATTGCTCTTTGCTTTGATCATGGTTTTATAATGCCAGCAGGTGTGGCAATATATTCCATTATCGAAAATAATAAAGATATTGACTACGCCTTAAAGCAAAAAGAGAGATGTCAGGTAAAAATTACTCTATAG,yibR,Uncharacterized protein YibR

I have observed this behaviour several times now. What is going on? Thank you for your support!

thorellk commented 1 year ago

Hi AlcaArctica,

I am not one of the developers but have a question for you: If you look up which CDSs that are in the yibR gene cluster (you can find this information e.g. in the gene_presence_absence_roary.csv file) are all of the genes in that cluster annotated as yibR? It could be that they are highly homologous and that all of them actually are yibR but that they just haven't gotten that name in the prokka annotation, this happens from time to time and is mentioned under Quick Start>Running Prokka in the panaroo docs:

"Panaroo is designed to correct for many sources of error introduced during annotation. This includes refinding genes in cases were there are inconsistencies in the annotation of the same genetic sequence in different genomes. It is also possible to force Prokka to use the same gene model for each genome. This will improve the consistency of annotations but any biases present in the model will be present in all genome annotations."

I don't know if this is the problem in your case but I have observed this in my own analyses so it came to my mind when I read your post.

BW, Kaisa

nzmacalasdair commented 1 year ago

It seems (as Kaisa has said) that this is a result of genes belonging to the yibR pangenome gene cluster not being annotated as such in most of the initial annotations input into panaroo. This is mostly likely a result of the pre-panaroo annotation pipeline struggling to identify the yibR gene in some cases, but you should probably examine the yibR cluster, and the sequences within it, to make sure that the gene cluster has been correctly annotated as yibR in the cases where it has been, and also that cases where isolates' genes belonging to this cluster have not been annotated as yibR are not the result of, for example, premature stop codon mutations, which may affect the biological interpretation of your results.

The best way to check this is to examine the yibR cluster in the final pangenome graph, documentation here. From this graph, you should be able to identify all of the genes present within the yibR cluster by their panaroo gene id, such as 1_0_1534 and 2_0_3915. Using all of these ids will allow you to identify the original annotations, sequences, and ids using column 3 of the gene_data.csv output file.

That can be a little time-consuming, particularly if you are not familiar with cytoscape and/or do not have it installed, so a quicker option is to lookup the yibR cluster in the gene_presence_absence.csv file (which contains a little more information than the .Rtab file) and finding each isolate's original annotation ids in its respective column. From these annotation ids, you should be able to extract the original annotation entry from the isolate's .gff.

Hopefully this helps clear things up! Let us know if you run into any other problems.