oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
432 stars 53 forks source link

CRISPR prediction assertion error #299

Closed ohickl closed 2 months ago

ohickl commented 2 months ago

Hi Oliver, I performed quite a few runs now on metagenomes of various sizes with the modified Bakta version from #289 and had barely any problems except for one sample and specifically with one contig. There, a PILER-CR predicted spacer seq doesnt match the seq in the fasta at the position parsed by Bakta.

Here the stdout from a run with regular Bakta and just the one contig:

Bakta v1.9.3
Options and arguments:
        input: .../bakta_devel/test/data/piler_test_contig.fna
        db: .../databases/imp3/bakta/db, version 5.1, full
        output: .../bakta_devel/test/sanity_check_piler_base
        force: True
        tmp directory: .../bakta_devel/test/tmp/tmp5fici64f
        prefix: annotation
        threads: 128
        debug: True
        meta mode: True
        translation table: 11
        keep contig headers: True

Bakta runs in DEBUG mode! Temporary data will not be destroyed at: .../bakta_devel/test/tmp/tmp5fici64f

parse genome sequences...
        imported: 1
        filtered & revised: 1
        contigs: 1

start annotation...
predict tRNAs...
        found: 0
predict tmRNAs...
        found: 0
predict rRNAs...
        found: 0
predict ncRNAs...
        found: 0
predict ncRNA regions...
        found: 0
predict CRISPR arrays...
array_id=1, contig_id=contig_0001, position=19605, repeat_length=47, repeat_seq=...........................T..................., spacer_seq=TTCCGGTG, spacer_length=8, gap_count=0, spacer_genome_seq=TGCGGACT
Traceback (most recent call last):
  File ".../bin/bakta", line 10, in <module>
      sys.exit(main())
  File ".../lib/python3.10/site-packages/bakta/main.py", line 210, in main
    genome['features'][bc.FEATURE_CRISPR] = crispr.predict_crispr(genome, contigs_path)
  File ".../lib/python3.10/site-packages/bakta/features/crispr.py", line 108, in predict_crispr
    assert spacer_seq == spacer_genome_seq  # assure PILER-CR provided sequence equals sequence extracted from genome
AssertionError

I do find the seq in the fasta at index -6 from crispr_spacer['start'], but am not sure how/if its relevant:

Start, Stop: 19652 19659
Spacer:  TGCGGAC
Spacer - 10:  TCCCTTCCGG
Spacer + 10:  TTCCGTTAAT
Spacer rev comp:  GTCCGCA

Im not sure, if the problem is PILER itself or the result parsing, but I get this with all Bakta versions.

I attached the fasta of the contig and a manual PILER run with the same params used in Bakta.

Let me know, if I should get in touch with the PILER people directly for this.

Best

Oskar piler_test_out.txt piler_test_contig.txt

oschwengers commented 2 months ago

Hi @ohickl , thank you very much for reporting! I just took a deeper look into this and found a wrong assumption that I made for the CRISPR parser of PILER-CR.

I wrongly assumed that all CRISPR spacers are of length N>=10. But, in your example, the CRISPR spacer is of length 8. Because of this, the regex didn't match to the result file. I pushed a PR with a fix. I'll run some tests and release a patch release, soon - hopefully by the end of this or next week.

Thanks again!