Open jmtsuji opened 9 months ago
When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.
This is expected behaviour. The rule 'circularize' is downstream of all the circularization methods. Because it no longer needs temp reads, it deletes them because all rules that use these reads have been running.
rotary run -s'--dag --until circularize' | grep -v 'Executing' | dot -Tpdf > until_circ_dag.pdf
My understanding is that Snakemake rebuilds the DAG every time it is run. So when you go --until circularize,
it builds the DAG until the rule species by the --until
without regard to the next steps. As you can see from the DAG graph above, it isn't aware of any of the steps, so it deletes the temp files it doesn't need.
When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.
This makes sense to me --> it can't find the QCed reads --> says I need to make the QCed reads --> the new QCed reads affect the following steps (in case the read polishing is non-deterministic) --> re-does everything post assembly.
@LeeBergstrand Sorry, I think I didn't explain the problem clearly. You are right that it would be normal for snakemake to delete the short QC read files after reaching the end of the --until
step specified. However, in my case, the pipeline is unable to finish the specified circularize
rule. It gets stuck at the checkpoint a few steps before that and errors out early. I think the error is because temp()
is deleting the QC read files too early.
So:
I start the pipeline and tell it --until circularize
The pipeline runs until the end of polish
module normally (before the circularize
module) -->
The pipeline doesn't know if short QC reads will be needed or not at the end of the circularize
module due to the checkpoint in the middle of the module -->
The pipeline pre-emptively deletes the short QC reads (this is the problem, I think) -->
The pipeline evaluates the checkpoint partway through the circularize
module and re-builds the DAG -->
The pipeline sees that short QC reads are actually needed to finish the circularize
module -->
The pipeline produces the error shown above and refuses to finish the circularize
module, because there are no short QC reads.
Does that make more sense? I think it is fundamentally a snakemake bug, but I think we could work around it temporarily by making the QC reads non-temp by default. However, I suspect that the bug will not appear if you have the pipeline run fully to the end. Let me know your thoughts here -- thanks!
Specifically, in the DAG figure you inserted (thanks for adding it!), here is the circularization module:
split_circular_and_linear_contigs
is a checkpoint. At that point, the DAG gets re-evaluated.
If you have any circular contigs, then before combine_circular_and_linear_contigs
, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.
@jmtsuji Thanks for the explanation. I'm testing on my machine to confirm the behaviour encountered. To clarify, does this error only occur if you run --until circularize
and then restart the pipeline? Isn't --until circularize
a corner case? I noticed you used it as an example in the README
.
If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.
Is there a separate rule for this polishing step? or is it calling polishing rules again somehow (is that why there is a {step}
in some of the polishing rule strings)?
The capability to set the QC read files as non-temp is already included in PR https://github.com/rotary-genomics/rotary/pull/130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.
Let's do this until we move to Snakemake 8.
@jmtsuji Thanks for the explanation. I'm testing on my machine to confirm the behaviour encountered. To clarify, does this error only occur if you run --until circularize and then restart the pipeline? Isn't --until circularize a corner case? I noticed you used it as an example in the README.
@LeeBergstrand Thanks for testing -- I am also running a few more test on my end with different run params. For now, I assume the error only occurs if I run --until
with something after the checkpoint. No re-start of the pipeline is needed -- the error occurs partway through the run.
Using --until circularize
gets you the genome without any annotations. This is a nice way to avoid the most resource and time-intensive parts of the pipeline, e.g., for testing.
If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.
Is there a separate rule for this polishing step? or is it calling polishing rules again somehow (is that why there is a
{step}
in some of the polishing rule strings)?
It's calling the Polypolish rule again -- this is why there is a {step}
wildcard (which can be either polish
or circularize
).
The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.
Let's do this until we move to Snakemake 8.
OK, sounds good. I am running a couple tests on my end to confirm that this fix works.
The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.
Let's do this until we move to Snakemake 8.
OK, sounds good. I am running a couple tests on my end to confirm that this fix works.
Sounds good. Thanks for giving me an overview of how checkpointing works and how it applies to our software. I send you the test results when I get them.
@jmtsuji I had to run --until circularize
twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with --until circularize
again:
BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
- process_start_genes: cram/circularize/identify/cram_start_genes.list
- prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta
- get_polished_contigs: blam/circularize/filter/blam_circular.fasta
- map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac
- circularize: checkpoints/circularize
- get_start_genes: blam/circularize/identify/blam_start_gene.ffn
- run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa
- polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log
- search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt
- get_polished_contigs: cram/circularize/filter/cram_circular.fasta
- combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta
- symlink_circularization: blam/circularize/blam_circularize.fasta
- process_start_genes: blam/circularize/identify/blam_start_genes.list
- get_start_genes: cram/circularize/identify/cram_start_gene.ffn
- finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta
- run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa
- symlink_circularization: cram/circularize/cram_circularize.fasta
- map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac
- search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt
- polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log
- combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta
- prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta
- finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta
Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log
@jmtsuji I had to run
--until circularize
twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with--until circularize
again:BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information. Remaining jobs: - process_start_genes: cram/circularize/identify/cram_start_genes.list - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta - get_polished_contigs: blam/circularize/filter/blam_circular.fasta - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac - circularize: checkpoints/circularize - get_start_genes: blam/circularize/identify/blam_start_gene.ffn - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt - get_polished_contigs: cram/circularize/filter/cram_circular.fasta - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta - symlink_circularization: blam/circularize/blam_circularize.fasta - process_start_genes: blam/circularize/identify/blam_start_genes.list - get_start_genes: cram/circularize/identify/cram_start_gene.ffn - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa - symlink_circularization: cram/circularize/cram_circularize.fasta - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log
@jmtsuji I get the above error every time I run with --until circularize
.
@jmtsuji I had to run
--until circularize
twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with--until circularize
again:BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information. Remaining jobs: - process_start_genes: cram/circularize/identify/cram_start_genes.list - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta - get_polished_contigs: blam/circularize/filter/blam_circular.fasta - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac - circularize: checkpoints/circularize - get_start_genes: blam/circularize/identify/blam_start_gene.ffn - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt - get_polished_contigs: cram/circularize/filter/cram_circular.fasta - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta - symlink_circularization: blam/circularize/blam_circularize.fasta - process_start_genes: blam/circularize/identify/blam_start_genes.list - get_start_genes: cram/circularize/identify/cram_start_gene.ffn - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa - symlink_circularization: cram/circularize/cram_circularize.fasta - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log
@jmtsuji I get the above error every time I run with
--until circularize
.Does that make more sense? I think it is fundamentally a snakemake bug, but I think we could work around it temporarily by making the QC reads non-temp by default. However, I suspect that the bug will not appear if you have the pipeline run fully to the end. Let me know your thoughts here -- thanks!
@jmtsuji Changing it to keep the qc reads fixes the above error. For me, the pipeline, when run end to end, still passes even with the keep qced reads
set to false in the config.
@LeeBergstrand My tests are also finished :-) Here are my results, using commit 864ca98 (pypolca branch):
keep_final_qc_read_files
is False (current default):--until circularize
: FAILS with the bug abovekeep_final_qc_read_files
as True (proposed temporary fix):--until circularize
: finishes successfullyThese results seem to match your tests results -- good.
Interestingly, I also checked what would happen if I tried to re-run the pipeline after it finished for all 4 tests:
keep_final_qc_read_files
is False (current default):--until circularize
: tries to re-analyze the data, starting from long/short read QCkeep_final_qc_read_files
is True (proposed temporary fix):--until circularize
: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing file S18/qc/long/S18_nanopore_qc.fastq.gz
(an early temp file from long read QC)Setting keep_final_qc_read_files
fixes two issues:
rule circularize
However, the full pipeline still finishes, like you mentioned, even if keep_final_qc_read_files
is False
I could go either way with changing keep_final_qc_read_files
to True as default. Because the full pipeline still works with keep_final_qc_read_files
as False, and discarding the QC read files saves a lot of space, I could see how keeping this feature as default could be nice. We could add a note to the README that keep_final_qc_read_files
must be True to run --until circularize
(i.e., to skip genome annotation). On the other hand, we could change our default to keep_final_qc_read_files
as True to avoid people running into errors.
The re-run tests I did suggest there is a secondary bug with our temp file setup that is preventing the pipeline from properly acknowledging long read QC files. I'll open a second issue for this.
@LeeBergstrand What are your thoughts about how we ought to set keep_final_qc_read_files
by default?
If
keep_final_qc_read_files
is False (current default):
- Whole pipeline: tries to re-analyze the data, starting from long/short read QC
--until circularize
: tries to re-analyze the data, starting from long/short read QCIf
keep_final_qc_read_files
is True (proposed temporary fix):
- Whole pipeline: acknowledges the pipeline is finished and does not re-analyze the data (just re-generates the final checkpoints) <- this is the expected behaviour
--until circularize
: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing fileS18/qc/long/S18_nanopore_qc.fastq.gz
(an early temp file from long read QC)
I got the same effect.
Moving forward
I could go either way with changing
keep_final_qc_read_files
to True as default. Because the full pipeline still works withkeep_final_qc_read_files
as False, and discarding the QC read files saves a lot of space, I could see how keeping this feature as default could be nice. We could add a note to the README thatkeep_final_qc_read_files
must be True to run--until circularize
(i.e., to skip genome annotation). On the other hand, we could change our default tokeep_final_qc_read_files
as True to avoid people running into errors.The re-run tests I did suggest there is a secondary bug with our temp file setup that is preventing the pipeline from properly acknowledging long read QC files. I'll open a second issue for this.
@LeeBergstrand What are your thoughts about how we ought to set
keep_final_qc_read_files
by default?
@jmtsuji I recommend we do the following:
If keep_final_qc_read_files
to True
by default.@jmtsuji I noticed that you use a Symlinking paradigm throughout the pipeline. For example:
# Conditional based on whether short read polishing was performed
rule pre_coverage_filter:
input:
"{sample}/polish/medaka/{sample}_consensus.fasta" if POLISH_WITH_SHORT_READS == False else "{sample}/polish/polca/{sample}_polca.fasta"
output:
temp("{sample}/polish/cov_filter/{sample}_pre_filtered.fasta")
run:
source_relpath = os.path.relpath(str(input),os.path.dirname(str(output)))
os.symlink(source_relpath,str(output))
This was quite troublesome when I made the temp()
file path as Snakemake would start deleting files that the symlinks would point to. I have found that Snakemake doesn't like symlinking because it relies on the presence and absence of files.
Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it.
If
keep_final_qc_read_files
is False (current default):
- Whole pipeline: tries to re-analyze the data, starting from long/short read QC
--until circularize
: tries to re-analyze the data, starting from long/short read QCIf
keep_final_qc_read_files
is True (proposed temporary fix):
- Whole pipeline: acknowledges the pipeline is finished and does not re-analyze the data (just re-generates the final checkpoints) <- this is the expected behaviour
--until circularize
: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing fileS18/qc/long/S18_nanopore_qc.fastq.gz
(an early temp file from long read QC)I got the same effect.
@jmtsuji It might also rerun the entire pipeline if the config file is modified. My tests aren't really supporting that but I would also look into it. I'm getting the following for rule: nanopore_qc_filter
Reason: Forced execution
Something is causing the forced execution of the very first rule.
@jmtsuji I recommend we do the following:
- Create two separate issues for this.
- Set
If keep_final_qc_read_files
toTrue
by default.- Move to Snakemake 8 and reassess this issue.
@LeeBergstrand Thanks for the thoughts -- I agree. Second issue for checkpointing is already created at #134 , and keep_final_qc_read_files
is now set as True by default in PR #138 (thanks for this). Let's keep this issue open and re-assess this bug after moving to Snakemake 8.
This was quite troublesome when I made the
temp()
file path as Snakemake would start deleting files that the symlinks would point to. I have found that Snakemake doesn't like symlinking because it relies on the presence and absence of files.Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it.
I've also run into symlinking issues while editing the pipeline and agree, it would be better to streamline these parts, if possible. Reducing the number of steps in the analysis this way might help with consistent DAG construction. (Just for reference for the future: for the polish steps, I think it's necessary to either copy or symlink the files before running the polish rules in order for the rules to be compatible with inputs from multiple modules. This is not the case for other rules, I think, so symlinking steps can probably be streamlined in many other cases.) I'll move this to a separate issue.
@jmtsuji It might also rerun the entire pipeline if the config file is modified. My tests aren't really supporting that but I would also look into it. I'm getting the following for rule:
nanopore_qc_filter
Reason: Forced execution
Something is causing the forced execution of the very first rule.
Also interesting -- I haven't seen "Forced execution" as a reason to re-run nanopore_qc_filter
in my tests, but I was running with a fresh install and config of rotary. I think snakemake also sometimes forces re-runs if it notices a snakefile has been updated compared to the last run.
@jmtsuji This may be related to this snakemake issue. https://github.com/snakemake/snakemake/pull/2737. The fix for this issue, in Snakemake 8, also caused other issues for me in https://github.com/snakemake/snakemake/issues/2982.
Run / system info
I'm running rotary commit 40381cc on Ubuntu 22.04.3, within the snakemake argument
--until circularize
(to run everything up until the start of the annotation module).Problem description
After reaching
checkpoint split_circular_and_linear_contigs
, I encountered the following error:When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as
temp()
-- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back viatouch
(with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.Based on a quick skim of the GitHub issue flagged in the Snakemake error message (https://github.com/snakemake/snakemake/issues/823), I suspect that the issue I am running into has to do with a bug in checkpointing. After hitting the
split_circular_and_linear_contigs
checkpoint, if my understanding is correct, Snakemake re-evaluates the rest of the DAG for the run. Until that point, snakemake doesn't know if the QC'ed reads will be needed again... it depends on if there are any circular contigs that get passed into thecircularization
module. It seems that Snakemake pre-emptively deleted thetemp()
QC read files before knowing if it was really safe to do so. As a result, I think the pipeline crashed once it realized that the QC read files were actually necessary files but were missing for continuing the run.Possible solution
Set the final QC read files as non-temp until we can find a better solution. An upgrade to Snakemake 8 might help fix this problem, or we might need to eventually contact the Snakemake authors about it.
The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.
@LeeBergstrand Any thoughts?