oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
448 stars 55 forks source link

Failed bakta run for some genomes #265

Closed ZarulHanifah closed 9 months ago

ZarulHanifah commented 10 months ago

I am running bakta on a bunch of genomes, many worked wonderfully, but a few actually failed, due to something CRISPR-related. One of the genome is GCA_025196405.1. Here is the error message.

Traceback (most recent call last):
  File "/home/mzar0002/miniconda3/envs/bakta/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "/home/mzar0002/miniconda3/envs/bakta/lib/python3.10/site-packages/bakta/main.py", line 210, in main
    genome['features'][bc.FEATURE_CRISPR] = crispr.predict_crispr(genome, contigs_path)
  File "/home/mzar0002/miniconda3/envs/bakta/lib/python3.10/site-packages/bakta/features/crispr.py", line 121, in predict_crispr
    assert len(crispr_array['repeats']) == int(copies), print(f"len(reps)={len(crispr_array['repeats'])}, int(copies)={int(copies)}")
AssertionError: None

The commands (part of a snakemake workflow):

bakta --db {input.db} \
            --output $outdir \
            --prefix $prefix \
            --locus-tag $locustag \
            --threads {threads} \
            --force --debug \
            {input.genome} 2> {log}

The log:

[after so many lines]
...
05:02:20.536 - INFO - NC_RNA_REGION - contig=contig_27, start=170467, stop=170494, strand=-, label=L19-Flavobacteria, product=L19-Flavobacteria ribosomal protein leader, length=28, truncated=None, score=39.9, evalue=8.3e-07
05:02:20.536 - INFO - NC_RNA_REGION - contig=contig_30, start=161750, stop=161875, strand=+, label=FMN, product=FMN riboswitch (RFN element), length=126, truncated=None, score=120.0, evalue=3.0e-20
05:02:20.537 - INFO - NC_RNA_REGION - contig=contig_31, start=45483, stop=45584, strand=-, label=SAM, product=SAM riboswitch (S box leader), length=102, truncated=None, score=75.1, evalue=3.8e-13
05:02:20.537 - INFO - NC_RNA_REGION - contig=contig_31, start=88433, stop=88529, strand=-, label=SAM, product=SAM riboswitch (S box leader), length=97, truncated=None, score=70.9, evalue=2.9e-12
05:02:20.537 - INFO - NC_RNA_REGION - contig=contig_31, start=179828, stop=179923, strand=-, label=SAM, product=SAM riboswitch (S box leader), length=96, truncated=None, score=65.5, evalue=4.0e-11
05:02:20.537 - INFO - NC_RNA_REGION - predicted=17
05:02:20.537 - DEBUG - MAIN - start CRISPR prediction
05:02:20.537 - DEBUG - CRISPR - cmd=['pilercr', '-in', '/tmp/tmp_z4vcynq/contigs.fna', '-out', '/tmp/tmp_z4vcynq/crispr.txt', '-noinfo', '-quiet']
05:02:21.448 - INFO - CRISPR - contig=contig_6, start=3, stop=822, spacer-length=30, repeat-length=47, # repeats=11, repeat-consensus=GTTGTGTTATATCACAAAGATATCCAAAATTGAAAGCAATTCACAAC, nt=[GTTGTGTTAT..AATTCACAAC]

I installed bakta through conda.

Thank you.

marade commented 10 months ago

I also got this error with a run on Bakta v1.9.1.

predict CRISPR arrays... len(reps)=5, int(copies)=6 Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/py310/bin/bakta", line 10, in sys.exit(main()) File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bakta/main.py", line 210, in main genome['features'][bc.FEATURE_CRISPR] = crispr.predict_crispr(genome, contigs_path) File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bakta/features/crispr.py", line 121, in predict_crispr assert len(crispr_array['repeats']) == int(copies), print(f"len(reps)={len(crispr_array['repeats'])}, int(copies)={int(copies)}") AssertionError: None

oschwengers commented 10 months ago

Hi @ZarulHanifah / @marade , thanks for reporting. Could you provide me with a genome sequence to reproduce & potentially debug this error? I'd like to take a deeper look into this.

ZarulHanifah commented 10 months ago

Thank you @oschwengers . Here you go. GCA_025196405.1_ASM2519640v1.fasta.txt

wsowens commented 10 months ago

Thanks for your work on this project! Just commenting to say that I am experiencing this issue as well with a similar backtrace. Happy to provide more more example genomes if that would be helpful.

Edit: rerunning bakta with the --skip-crispr flag circumvents this issue.

oschwengers commented 10 months ago

@ZarulHanifah & @marade , I've merged a PR #267 fixing this. I wrongly supposed that there is always an even number of spacers & repeats in each CRISPR array. I fixed this and improved the PILER-CR parser. You can use this already from https://github.com/oschwengers/bakta/tree/main or wait until I've released a patch v1.9.2 - maybe somewhen this week.

ZarulHanifah commented 8 months ago

Thank you @oschwengers ... unfortunately, another AssertionError from PILER-CR:

Bakta v1.9.2
Options and arguments:
        input: /fs03/jm41/Zarul/C002_B2_results/derep/dereplicated_genomes/metabat.641.fasta
        db: /fs03/ie79/db/bakta_db, version 5.0, full
        output: /fs03/jm41/Zarul/C002_B2_results/bakta/metabat.641
        force: True
        tmp directory: /tmp/tmpbg0fpfp9
        prefix: metabat.641
        threads: 2
        debug: True
        translation table: 11
        locus tag prefix: METABAT.641

Bakta runs in DEBUG mode! Temporary data will not be destroyed at: /tmp/tmpbg0fpfp9

parse genome sequences...
        imported: 388
        filtered & revised: 388
        contigs: 388

start annotation...
predict tRNAs...
        found: 112
predict tmRNAs...
        found: 1
predict rRNAs...
        found: 0
predict ncRNAs...
        found: 2
predict ncRNA regions...
        found: 13
predict CRISPR arrays...
Traceback (most recent call last):
  File "/fs03/ie79/Zarul/status_nanopore/C002_B2/.snakemake/conda/22185ec851ca2597fabecb499d58e23d_/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "/fs03/ie79/Zarul/status_nanopore/C002_B2/.snakemake/conda/22185ec851ca2597fabecb499d58e23d_/lib/python3.10/site-packages/bakta/main.py", line 210, in main
    genome['features'][bc.FEATURE_CRISPR] = crispr.predict_crispr(genome, contigs_path)
  File "/fs03/ie79/Zarul/status_nanopore/C002_B2/.snakemake/conda/22185ec851ca2597fabecb499d58e23d_/lib/python3.10/site-packages/bakta/features/crispr.py", line 105, in predict_crispr
    assert spacer_seq == spacer_genome_seq  # assure PILER-CR provided sequence equals sequence extracted from genome
AssertionError
oschwengers commented 8 months ago

hmm... ok could you provide the metabat.641.fasta input file to debug this?

ZarulHanifah commented 8 months ago

Right, here you go! metabat.641.fasta.txt

Thank you!