rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Rotary fails to repolish circular contigs due to workflow errors. #193

Closed LeeBergstrand closed 3 months ago

LeeBergstrand commented 3 months ago

Problem Description

When running rotary on specific FASTQ files, I get the following error:

WorkflowError in rule stitch_medaka in file /home/lee_bergstrand/rnd/rotary/rotary/rules/./polish.smk, line 127:
Failed to open input file: STR32-b/circularize/medaka/STR32-b_contigs.txt. Has it been deleted by another process? (rule stitch_medaka, line 221, /home/lee_bergstrand/rnd/rotary/rotary/rules/./polish.smk)

This rule occurs after the DAG is re-evaluated following the triggering of the checkpoint split_circular_and_linear_contigs

I noticed that the directories medaka and medaka_input do not exist under the circularize directory. The only directories under circularize are filter/lists with the linear and circular contig lists. No FASTA files were generated under circularize.

LeeBergstrand commented 3 months ago

The above error occurs when running snakemake 8.0 (develop_lee branch). When running snakemake 7.0 (develop branch), I get the following error message:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - combine_circular_and_linear_contigs: SO11/circularize/combine/SO11_combined.fasta
 - finalize_circular_contig_rotation: SO11/circularize/combine/SO11_circular.fasta
 - generate_contig_manifest: SO11/circularize/medaka/SO11_contigs.txt
 - prepare_medaka_circularize_input: SO11/circularize/medaka_input/SO11_input.fasta
 - all: 
 - run_circlator: SO11/circularize/circlator/SO11_rotated.fasta, SO11/circularize/circlator, SO11/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - get_polished_contigs: SO11/circularize/filter/SO11_circular.fasta
 - get_start_genes: SO11/circularize/identify/SO11_start_gene.ffn
 - search_contig_start: SO11/circularize/identify/SO11_circular.faa, SO11/circularize/identify/SO11_circular.ffn, SO11/circularize/identify/SO11_circular.gff, SO11/circularize/identify/SO11_hmmsearch_hits.txt, SO11/circularize/identify/SO11_hmmsearch_hits_no_comments.txt
 - process_start_genes: SO11/circularize/identify/SO11_start_genes.list
 - bypass_circularization: SO11/circularize/combine/SO11_linear.fasta
 - get_polished_contigs: SO11/circularize/filter/SO11_linear.fasta
 - stitch_medaka: SO11/circularize/medaka/SO11_consensus.fasta
 - circularize: checkpoints/circularize
 - calculate_final_long_read_coverage: SO11/annotation/coverage/SO11_long_read.bam, SO11/annotation/coverage/SO11_long_read.bam.bai, SO11/annotation/coverage/SO11_long_read_coverage.tsv
 - symlink_circularization: SO11/circularize/SO11_circularize.fasta
 - symlink_logs: SO11/annotation/logs, SO11/annotation/stats
 - annotation: checkpoints/annotation
 - summarize_annotation: SO11/SO11_annotation_summary.zip
 - run_dfast: SO11/annotation/dfast/SO11_genome.fna, SO11/annotation/dfast/SO11_cds.fna, SO11/annotation/dfast/SO11_protein.faa, SO11/annotation/dfast/SO11_genome.embl, SO11/annotation/dfast/SO11_genome.gbk, SO11/annotation/dfast/SO11_genome.gff, SO11/annotation/dfast/SO11_rna.fna, SO11/annotation/dfast/SO11_pseudogene_summary.tsv, SO11/annotation/dfast
 - run_gtdbtk: SO11/annotation/gtdbtk/batchfile.tsv, SO11/annotation/gtdbtk/SO11_gtdbtk.summary.tsv, SO11/annotation/gtdbtk/run_files
 - run_checkm2: SO11/annotation/checkm/SO11_checkm_quality_report.tsv, SO11/annotation/checkm
LeeBergstrand commented 3 months ago

@jmtsuji Reverting to a version of the code before the medaka optimization (https://github.com/rotary-genomics/rotary/pull/147), where we split the medaka polishing into multiple steps, allows for the sample to continue passed polishing. The old version of the code does not require the contig manifest and thus does not break.

LeeBergstrand commented 3 months ago

@jmtsuji Any thoughts on how to fix this?

LeeBergstrand commented 3 months ago

In at least one of the genomes tested, we failed to find a start gene in the circular element, which means rotation failed. Perhaps it is a circular plasmid or an artifact? Several had very poor assemblies (>30 contigs).

jmtsuji commented 3 months ago

@LeeBergstrand Interesting -- medaka should only run in the circularize module if no short reads are provided (as set by rule finalize_circular_contig_rotation). Just to check, are you running standalone long read assembly when you get this error, or are you running hybrid assembly? This will help with troubleshooting. Thanks.

LeeBergstrand commented 3 months ago

@LeeBergstrand Interesting -- medaka should only run in the circularize module if no short reads are provided (as set by rule finalize_circular_contig_rotation). Just to check, are you running standalone long read assembly when you get this error, or are you running hybrid assembly? This will help with troubleshooting. Thanks.

Yes, I was running a standalone long read assembly with no short reads. I had no illumina data for these samples.

LeeBergstrand commented 3 months ago

@jmtsuji So the newest version of running medaka requires a contig manifest file "{sample}/{step}/medaka/{sample}_contigs.txt". I think this file isn't created for when medaka is rerun. The old version of the medaka code didn't require the generation of the contig manifest file.

checkpoint generate_contig_manifest:
    input:
        "{sample}/{step}/medaka_input/{sample}_input.fasta"
    output:
        "{sample}/{step}/medaka/{sample}_contigs.txt"
    log:
        "{sample}/logs/{step}/contig_names_medaka.log"
    benchmark:
        "{sample}/benchmarks/{step}/medaka.txt"
    shell:
        """
        (grep '^>' {input} | cut -f1 -d' ' | tr -d '>' > {output}) 2> {log}
        """
jmtsuji commented 3 months ago

@LeeBergstrand Thanks for the additional background. Good to know you are working with long reads only -- that helps the error make more sense.

Looking into the config manifest file sounds like a good step for debugging. I might not be able to get to this issue for a couple days, but if changing the config manifest code doesn't solve the issue, I'd be happy to look into other possible causes then!

(As a side note, on my end, I don't test long read only mode as much as hybrid mode at this point, so it's easier for issues to slip through, I think.)

LeeBergstrand commented 3 months ago

@jmtsuji I think this is a circular dependancy issue. Basically, how the medaka code is ran right now is that it needs the input fasta file to build a list of the contigs in that file to then send out be polished individually. Normally, this list is generated from files that pre-exist from the polishing stage. However, when its rerun in circularize it checks for the those files when building the DAG but breaks because they don't exist yet. However, the DAG reevaluation is there to build the files FASTA files that it needs to make the manifest.

LeeBergstrand commented 3 months ago

@jmtsuji My jet brains AI mentioned the following which might be relevant:

Snakemake runs rules and checkpoints asynchronously which sometimes can create a race condition where one rule is looking for the output of another rule, but that output isn't ready yet. If that's the case here, explicitly setting the order of rule execution using priority keyword could help. You can add priority: 5 to all the rules that must happen before this checkpoint and then priority: 1 to the checkpoint itself and the rules that depend on it. This will cause Snakemake to run the high-priority rules first, ensuring that the input files are ready before the checkpoint.

LeeBergstrand commented 3 months ago

@jmtsuji May addressed by https://github.com/rotary-genomics/rotary/pull/194

jmtsuji commented 3 months ago

@LeeBergstrand Although I haven't been able to run any tests on my machine (working while on the go right now), after looking through the code, it seems like you've pinned down a good possible cause for this snakemake error (i.e., the missing "{sample}/circularize/medaka/{sample}_contigs.txt" file). My flight is about to leave, so I can't work on this more at the moment, but I think that #194 is a reasonable short-term workaround if it resolves the problem on your machine. The circular contig names output by checkpoint split_circular_and_linear_contigs should match the contigs where medaka needs to be re-run in the circularize module, so it's okay to use these contig names for "{sample}/circularize/medaka/{sample}_contigs.txt" as you've done.

Longer-term, I think we should develop a more elegant solution for this issue, because it looks like we are trying to write "{sample}/{step}/medaka/{sample}_contigs.txt" twice for the circularize module after the changes in #194. It's possible that setting priorities might help resolve things. Otherwise, we might have to think more about how to get the steps to execute in an appropriate order so that we don't get 'circular dependencies' in the DAG. How does this sound? Sorry, got to run, but feel free to merge #194 in for now.

LeeBergstrand commented 3 months ago

@LeeBergstrand Although I haven't been able to run any tests on my machine (working while on the go right now), after looking through the code, it seems like you've pinned down a good possible cause for this snakemake error (i.e., the missing "{sample}/circularize/medaka/{sample}_contigs.txt" file). My flight is about to leave, so I can't work on this more at the moment, but I think that #194 is a reasonable short-term workaround if it resolves the problem on your machine. The circular contig names output by checkpoint split_circular_and_linear_contigs should match the contigs where medaka needs to be re-run in the circularize module, so it's okay to use these contig names for "{sample}/circularize/medaka/{sample}_contigs.txt" as you've done.

Longer-term, I think we should develop a more elegant solution for this issue, because it looks like we are trying to write "{sample}/{step}/medaka/{sample}_contigs.txt" twice for the circularize module after the changes in #194. It's possible that setting priorities might help resolve things. Otherwise, we might have to think more about how to get the steps to execute in an appropriate order so that we don't get 'circular dependencies' in the DAG. How does this sound? Sorry, got to run, but feel free to merge #194 in for now.

Sounds good. I think this is a circular dependancy issue. I ran a good test of https://github.com/rotary-genomics/rotary/pull/194 and it works.

LeeBergstrand commented 3 months ago

Will merge.

LeeBergstrand commented 3 months ago

@jmtsuji Temporary fix with: https://github.com/rotary-genomics/rotary/pull/198