oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
448 stars 55 forks source link

Incorrect locus tag incrementation #328

Closed NonAggressiveHail closed 1 month ago

NonAggressiveHail commented 1 month ago

When running with default options, locus tags increment in counts of 5 when they should increment in counts of 1.

Commands run: fasta_name=Pa_DK1_substr_NH57388A_6643 bakta --output ./${fasta_name} --prefix ${fasta_name} --proteins ../../raw_data/genomes/Pa_PAO1_107_annotations_with_sip_aes.gbk --force --complete --gram - --keep-contig-headers --locus-tag ${fasta_name##*_} --threads 12 ../../data/oriented_genomes/Pa_DK1_substr_NH57388A_6643/Pa_DK1_substr_NH57388A_6643_reoriented.fasta --debug

Debug output: Bakta v1.9.4 Options and arguments: input: /shared/home/jgh8/20230208_UKent/20240520_siderophore_prediction/data/oriented_genomes/Pa_DK1_substr_NH57388A_6643/Pa_DK1_substr_NH57388A_6643_reoriented.fasta db: /shared/home/jgh8/20230208_UKent/20240520_siderophore_prediction/programs/bakta/db, version 5.1, full user proteins: /shared/home/jgh8/20230208_UKent/20230307_ps_phylogenetics/raw_data/genomes/Pa_PAO1_107_annotations_with_sip_aes.gbk output: /shared/home/jgh8/20230208_UKent/20240520_siderophore_prediction/data/bakta_troubleshooting/Pa_DK1_substr_NH57388A_6643 force: True tmp directory: /tmp/tmpgccs79c2 prefix: Pa_DK1_substr_NH57388A_6643 threads: 12 debug: True translation table: 11 gram: - locus tag prefix: 6643 complete replicons: True keep contig headers: True

Bakta runs in DEBUG mode! Temporary data will not be destroyed at: /tmp/tmpgccs79c2

parse genome sequences... imported: 1 filtered & revised: 1 chromosomes: 1

start annotation... predict tRNAs... found: 64 predict tmRNAs... found: 1 predict rRNAs... found: 12 predict ncRNAs... found: 49 predict ncRNA regions... found: 30 predict CRISPR arrays... found: 4 predict & annotate CDSs... predicted: 5645 discarded spurious: 3 revised translational exceptions: 1 detected IPSs: 5500 found PSCs: 123 found PSCCs: 10 lookup annotations... conduct expert systems... amrfinder: 8 protein sequences: 605 user protein sequences: 5240 signal peptides: 673 combine annotations and mark hypotheticals... detect pseudogenes... pseudogene candidates: 21 found pseudogenes: 4 analyze hypothetical proteins: 67 detected Pfam hits: 1 calculated proteins statistics revise special cases... extract sORF... potential: 35196 discarded due to overlaps: 28753 discarded spurious: 0 detected IPSs: 1 found PSCs: 0 lookup annotations... filter and combine annotations... filtered sORFs: 1 signal peptides: 0 detect gaps... found: 0 detect oriCs/oriVs... found: 1 detect oriTs... found: 0 apply feature overlap filters... select features and create locus tags... selected: 5800 improve annotations... revised gene symbols: 105

genome statistics: Genome size: 6,212,531 bp Contigs/replicons: 1 GC: 66.6 % N50: 6,212,531 N ratio: 0.0 % coding density: 90.4 %

annotation summary: tRNAs: 63 tmRNAs: 1 rRNAs: 12 ncRNAs: 49 ncRNA regions: 30 CRISPR arrays: 4 CDSs: 5639 hypotheticals: 66 pseudogenes: 4 signal peptides: 673 sORFs: 1 gaps: 0 oriCs/oriVs: 1 oriTs: 0

export annotation results to: /shared/home/jgh8/20230208_UKent/20240520_siderophore_prediction/data/bakta_troubleshooting/Pa_DK1_substr_NH57388A_6643 human readable TSV... GFF3... INSDC GenBank & EMBL... /shared/home/jgh8/miniconda3/envs/bakta/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py:727: BiopythonWarning: Increasing length of locus line to allow long name. This will result in fields that are not in usual positions. warnings.warn( genome sequences... feature nucleotide sequences... translated CDS sequences... circular genome plot... hypothetical TSV... translated hypothetical CDS sequences... machine readable JSON... genome and annotation summary...

If you use these results please cite Bakta: https://doi.org/10.1099/mgen.0.000685 Annotation successfully finished in 9:30 [mm:ss].

In the output tsv file the first two loci are: #Sequence Id Type Start Stop Strand Locus Tag NZ_LN870292.1|chromosome cds 1 1545 + 6643_00005 NZ_LN870292.1|chromosome cds 1574 2677 + 6643_00010

Clearly the incrementation is 5, this conflicts with the manual page which says that the default incrementation should be 1. It also says that this can be changed: --locus-tag LOCUS_TAG Locus tag prefix (default = autogenerated) --locus-tag-increment {1,5,10} Locus tag increment: 1/5/10 (default = 1) --keep-contig-headers Keep original contig headers

However this option is not present when bakta --help is run: --locus-tag LOCUS_TAG Locus tag prefix (default = autogenerated) --keep-contig-headers Keep original contig headers

And trying to run regardless returns an error: bakta: error: unrecognized arguments: --locus-tag-increment

Bakta was installed from conda

oschwengers commented 1 month ago

Hi @NonAggressiveHail , thanks for reporting. However, this is not a bug, but just a discrepancy between the tagged version you're using (v1.9.4) and the active main branch, where I just added the causing commit 2h ago ;-) https://github.com/oschwengers/bakta/commit/650eedc17e4814c15dad604487e8c88aab72fad4

An increment of 5 was the default up to v1.9.4 but will be changed with the upcoming v1.10.0.

If you need the doc of your version, than please have a look at the related release https://github.com/oschwengers/bakta/tree/v1.9.4

Since this is not a bug, I'll close this for now. Just in case you have any further questions, please do not hesitate to reach out and maybe re-open this, again.