oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
428 stars 51 forks source link

IndexError in detect pseudogenes when processing metagenome-assembled genome #133

Closed samnooij closed 2 years ago

samnooij commented 2 years ago

I have been running Bakta on >200 genome sequences from bacterial isolates and metagenome-assembled genomes. On all but one this has worked flawlessly. On that one genome, I get an IndexError (see below.) This appears to happen in the detect pseudogenes stage. My guess would be that maybe there is an alignment of length 0, and that therefore any index number is out of range. Would that be possible? As a work-around I have now run the same command with the --skip-pseudo flag. That works without problems.

Traceback (most recent call last):
  File "/path/to/my_project/.snakemake/conda/d0a07bf5/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "/path/to/my_project/.snakemake/conda/d0a07bf5/lib/python3.10/site-packages/bakta/main.py", line 286, in main
    pseudogenes = feat_cds.detect_pseudogenes(pseudo_candidates, cdss, genome) if len(pseudo_candidates) > 0 else []
  File "/path/to/my_project/.snakemake/conda/d0a07bf5/lib/python3.10/site-packages/bakta/features/cds.py", line 643, in detect_pseudogenes
    observations, positions = detect_pseudogenization_observations(
  File "/path/to/my_project/.snakemake/conda/d0a07bf5/lib/python3.10/site-packages/bakta/features/cds.py", line 798, in detect_pseudogenization_observations
    compare_alignments(observations, alignment, ref_alignment, cds, positions, elongated_edge)
  File "/path/to/my_project/.snakemake/conda/d0a07bf5/lib/python3.10/site-packages/bakta/features/cds.py", line 815, in compare_alignments
    if alignment[up_length-1] == 'M' and ref_alignment[up_length-1] != 'M':
IndexError: string index out of range

This error occurs when I run bakta as:

bakta --db {params.db} -o {params.output_dir} --prefix {wildcards.sample} -t {threads} {input}

(Where variables in {curly braces} indicate variables from Snakemake. As I said, this worked okay with all other genomes.)

I installed bakta using conda, using this YAML file:

name: bakta

channels:
- defaults
- bioconda
- conda-forge

dependencies:
- bakta=1.5.0

(With the command: mamba env create -f bakta.yaml.)

I have also tested that one genome with the --verbose flag. Please find the resulting log file enclosed. test_bakta_error-logfile.txt

oschwengers commented 2 years ago

Dear @samnooij , thanks for the heads-up and very detailed report. We recognized this bug just yesterday which occurs only on few genomes. We've pinpoined this to cases in which short very-proximate CDS are encoded in the elongated 5'/3' regions of each other, and we have a patch for this #130 We're working on this and plan to release a patch in v1.5.1 soon.