tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
847 stars 226 forks source link

Genes missing /gene field in .gbk output #708

Open NonAggressiveHail opened 2 months ago

NonAggressiveHail commented 2 months ago

Hello,

I am currently reannotating many P. aeruginosa genomes, and I want to use the PAO1 annotations from the pseudomonas genome database, with a couple of other proteins, as a reference for the first round of annotation. However, when PAO1 itself is annotated not all the expected genes are there, and I am struggling to work out why.

In my reference file, Pa_PAO1_107_annotations.gbk, on gene has the following entry: gene complement(2694546..2694764) /gene="PA2412" /locus_tag="PA2412" /db_xref="Pseudomonas Genome DB: PGD107602" CDS complement(2694546..2694764) /gene="PA2412" /locus_tag="PA2412" /product="conserved hypothetical protein" /codon_start=1 /translation_table=11 /translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKK DCLAYIEEVWTDMRPLSLRQHMDKAAG" /protein_id="NP_251102.1"

After converting to a fasta file with prokka-genbank_to_fasta_db, we have the following entry: >NP_251102.1 ~~~PA2412~~~conserved hypothetical protein MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRP LSLRQHMDKAAG

I then run Prokka with: prokka --outdir ./Pa_PAO1_107/ --prefix Pa_PAO1_107 --proteins ../../raw_data/genomes/siderophore_annotations.db --force --locustag Pa_PAO1_107 --cpus 8 ../oriented_genomes/Pa_PAO1_107/Pa_PAO1_107_reoriented.fasta

In the output file, Pa_PAO1_107.gbk I have no matches for PA2412, however I do have the following entry CDS complement(2694064..2694282) /locus_tag="Pa_PAO1_107_02485" /inference="ab initio prediction:Prodigal:002006" /inference="similar to AA sequence:siderophore_annotations.db:NP_251102.1" /note="conserved hypothetical protein" /codon_start=1 /transl_table=11 /product="hypothetical protein" /translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLK KDCLAYIEEVWTDMRPLSLRQHMDKAAG"

You can see that the two amino acid sequences are identical: MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG

I am unsure why, with identical amino acid sequences, this has not been annotated with /gene="PA2412". Clearly it has matched to some degree, as the inference is /inference="similar to AA sequence:siderophore_annotations.db:NP_251102.1".

For another protein it has worked as expected: Reference entry: gene complement(2693781..2694545) /gene="PA2411" /locus_tag="PA2411" /db_xref="Pseudomonas Genome DB: PGD107600" CDS complement(2693781..2694545) /gene="PA2411" /locus_tag="PA2411" /product="probable thioesterase" /codon_start=1 /translation_table=11 /translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGAR MAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGF FACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADF LLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQR EAEVLAVVECQVEAWRAGQGAAALAVESAAIC" /protein_id="NP_251101.1"

Fasta entry: >NP_251101.1 ~~~PA2411~~~probable thioesterase MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGARMAEPLQTDLASLAQQ LARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGFFACGTAAPSRRAEYDR GFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADFLLCGSYRHQRRPPLACP IRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQREAEVLAVVECQVEAWRAG QGAAALAVESAAIC

Output .gbk entry: CDS complement(2693299..2694063) /gene="PA2411" /locus_tag="Pa_PAO1_107_02484" /inference="ab initio prediction:Prodigal:002006" /inference="similar to AA sequence:siderophore_annotations.db:NP_251101.1" /codon_start=1 /transl_table=11 /product="putative thioesterase" /translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGA RMAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPL GFFACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILR ADFLLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFF IHQREAEVLAVVECQVEAWRAGQGAAALAVESAAIC"

Why is it that for the second entry there is a gene field, but for the first there is not?

Thanks