I am currently reannotating many P. aeruginosa genomes, and I want to use the PAO1 annotations from the pseudomonas genome database, with a couple of other proteins, as a reference for the first round of annotation. However, when PAO1 itself is annotated not all the expected genes are there, and I am struggling to work out why.
In my reference file, Pa_PAO1_107_annotations.gbk, on gene has the following entry:
gene complement(2694546..2694764)/gene="PA2412"/locus_tag="PA2412"/db_xref="Pseudomonas Genome DB: PGD107602"CDS complement(2694546..2694764)/gene="PA2412"/locus_tag="PA2412"/product="conserved hypothetical protein"/codon_start=1/translation_table=11/translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG"/protein_id="NP_251102.1"
After converting to a fasta file with prokka-genbank_to_fasta_db, we have the following entry:
>NP_251102.1 ~~~PA2412~~~conserved hypothetical proteinMTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG
I then run Prokka with:
prokka --outdir ./Pa_PAO1_107/ --prefix Pa_PAO1_107 --proteins ../../raw_data/genomes/siderophore_annotations.db --force --locustag Pa_PAO1_107 --cpus 8 ../oriented_genomes/Pa_PAO1_107/Pa_PAO1_107_reoriented.fasta
In the output file, Pa_PAO1_107.gbk I have no matches for PA2412, however I do have the following entry
CDS complement(2694064..2694282)/locus_tag="Pa_PAO1_107_02485"/inference="ab initio prediction:Prodigal:002006"/inference="similar to AAsequence:siderophore_annotations.db:NP_251102.1"/note="conserved hypothetical protein"/codon_start=1/transl_table=11/product="hypothetical protein"/translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG"
You can see that the two amino acid sequences are identical:
MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAGMTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG
I am unsure why, with identical amino acid sequences, this has not been annotated with /gene="PA2412". Clearly it has matched to some degree, as the inference is /inference="similar to AA sequence:siderophore_annotations.db:NP_251102.1".
For another protein it has worked as expected:
Reference entry:
gene complement(2693781..2694545)/gene="PA2411"/locus_tag="PA2411"/db_xref="Pseudomonas Genome DB: PGD107600"CDS complement(2693781..2694545)/gene="PA2411"/locus_tag="PA2411"/product="probable thioesterase"/codon_start=1/translation_table=11/translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGARMAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGFFACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADFLLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQREAEVLAVVECQVEAWRAGQGAAALAVESAAIC"/protein_id="NP_251101.1"
Fasta entry:
>NP_251101.1 ~~~PA2411~~~probable thioesterase MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGARMAEPLQTDLASLAQQ LARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGFFACGTAAPSRRAEYDR GFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADFLLCGSYRHQRRPPLACP IRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQREAEVLAVVECQVEAWRAG QGAAALAVESAAIC
Output .gbk entry:
CDS complement(2693299..2694063)/gene="PA2411"/locus_tag="Pa_PAO1_107_02484"/inference="ab initio prediction:Prodigal:002006"/inference="similar to AAsequence:siderophore_annotations.db:NP_251101.1"/codon_start=1/transl_table=11/product="putative thioesterase"/translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGARMAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGFFACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADFLLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQREAEVLAVVECQVEAWRAGQGAAALAVESAAIC"
Why is it that for the second entry there is a gene field, but for the first there is not?
Hello,
I am currently reannotating many P. aeruginosa genomes, and I want to use the PAO1 annotations from the pseudomonas genome database, with a couple of other proteins, as a reference for the first round of annotation. However, when PAO1 itself is annotated not all the expected genes are there, and I am struggling to work out why.
In my reference file, Pa_PAO1_107_annotations.gbk, on gene has the following entry:
gene complement(2694546..2694764)
/gene="PA2412"
/locus_tag="PA2412"
/db_xref="Pseudomonas Genome DB: PGD107602"
CDS complement(2694546..2694764)
/gene="PA2412"
/locus_tag="PA2412"
/product="conserved hypothetical protein"
/codon_start=1
/translation_table=11
/translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKK
DCLAYIEEVWTDMRPLSLRQHMDKAAG"
/protein_id="NP_251102.1"
After converting to a fasta file with
prokka-genbank_to_fasta_db
, we have the following entry:>NP_251102.1 ~~~PA2412~~~conserved hypothetical protein
MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRP
LSLRQHMDKAAG
I then run Prokka with:
prokka --outdir ./Pa_PAO1_107/ --prefix Pa_PAO1_107 --proteins ../../raw_data/genomes/siderophore_annotations.db --force --locustag Pa_PAO1_107 --cpus 8 ../oriented_genomes/Pa_PAO1_107/Pa_PAO1_107_reoriented.fasta
In the output file, Pa_PAO1_107.gbk I have no matches for PA2412, however I do have the following entry
CDS complement(2694064..2694282)
/locus_tag="Pa_PAO1_107_02485"
/inference="ab initio prediction:Prodigal:002006"
/inference="similar to AA
sequence:siderophore_annotations.db:NP_251102.1"
/note="conserved hypothetical protein"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLK
KDCLAYIEEVWTDMRPLSLRQHMDKAAG"
You can see that the two amino acid sequences are identical:
MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG
MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG
I am unsure why, with identical amino acid sequences, this has not been annotated with
/gene="PA2412"
. Clearly it has matched to some degree, as the inference is/inference="similar to AA sequence:siderophore_annotations.db:NP_251102.1"
.For another protein it has worked as expected: Reference entry:
gene complement(2693781..2694545)
/gene="PA2411"
/locus_tag="PA2411"
/db_xref="Pseudomonas Genome DB: PGD107600"
CDS complement(2693781..2694545)
/gene="PA2411"
/locus_tag="PA2411"
/product="probable thioesterase"
/codon_start=1
/translation_table=11
/translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGAR
MAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGF
FACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADF
LLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQR
EAEVLAVVECQVEAWRAGQGAAALAVESAAIC"
/protein_id="NP_251101.1"
Fasta entry:
>NP_251101.1 ~~~PA2411~~~probable thioesterase MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGARMAEPLQTDLASLAQQ LARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGFFACGTAAPSRRAEYDR GFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADFLLCGSYRHQRRPPLACP IRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQREAEVLAVVECQVEAWRAG QGAAALAVESAAIC
Output .gbk entry:
CDS complement(2693299..2694063)
/gene="PA2411"
/locus_tag="Pa_PAO1_107_02484"
/inference="ab initio prediction:Prodigal:002006"
/inference="similar to AA
sequence:siderophore_annotations.db:NP_251101.1"
/codon_start=1
/transl_table=11
/product="putative thioesterase"
/translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGA
RMAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPL
GFFACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILR
ADFLLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFF
IHQREAEVLAVVECQVEAWRAGQGAAALAVESAAIC"
Why is it that for the second entry there is a gene field, but for the first there is not?
Thanks