rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

QC read files deleted too early #132

Open jmtsuji opened 5 months ago

jmtsuji commented 5 months ago

Run / system info

I'm running rotary commit 40381cc on Ubuntu 22.04.3, within the snakemake argument --until circularize (to run everything up until the start of the annotation module).

Problem description

After reaching checkpoint split_circular_and_linear_contigs, I encountered the following error:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - search_contig_start: S18/circularize/identify/S18_circular.faa, S18/circularize/identify/S18_circular.ffn, S18/circularize/identify/S18_circular.gff, S18/circularize/identify/S18_hmmsearch_hits.txt, S18/circularize/identify/S18_hmmsearch_hits_no_comments.txt
 - finalize_circular_contig_rotation: S18/circularize/combine/S18_circular.fasta
 - combine_circular_and_linear_contigs: S18/circularize/combine/S18_combined.fasta
 - process_start_genes: S18/circularize/identify/S18_start_genes.list
 - polish_polypolish: S18/circularize/polypolish/S18_R1.clean.sam, S18/circularize/polypolish/S18_R2.clean.sam, S18/circularize/polypolish/S18_polypolish.fasta, S18/circularize/polypolish/polypolish.debug.log, S18/stats/circularize/polypolish_changes.log
 - map_short_reads_for_polishing: S18/circularize/polypolish/S18_R1.sam, S18/circularize/polypolish/S18_R2.sam, S18/circularize/polypolish/input/S18_input.fasta.0123, S18/circularize/polypolish/input/S18_input.fasta.amb, S18/circularize/polypolish/input/S18_input.fasta.ann, S18/circularize/polypolish/input/S18_input.fasta.bwt.2bit.64, S18/circularize/polypolish/input/S18_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: S18/circularize/identify/S18_start_gene.ffn
 - prepare_polypolish_circularize_input: S18/circularize/polypolish/input/S18_input.fasta
 - run_circlator: S18/circularize/circlator/S18_rotated.fasta, S18/circularize/circlator, S18/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: S18/circularize/S18_circularize.fasta
 - get_polished_contigs: S18/circularize/filter/S18_circular.fasta
Complete log: .snakemake/log/2024-02-15T154852.164572.snakemake.log
Executing: snakemake --snakefile /2022_08_01_rotary_testing/rotary/rotary/rules/rotary.smk --configfile /2022_08_01_rotary_testing/output_dir/config.yaml --directory /2022_08_01_rotary_testing/output_dir --conda-prefix /2022_08_01_rotary_testing/rotary_db_dir/rotary_conda_env --conda-frontend mamba --use-conda --reason --rerun-incomplete --printshellcmds --jobs 48 --until circularize
Command 'snakemake --snakefile /2022_08_01_rotary_testing/rotary/rotary/rules/rotary.smk --configfile /2022_08_01_rotary_testing/output_dir/config.yaml --directory /2022_08_01_rotary_testing/output_dir --conda-prefix /2022_08_01_rotary_testing/rotary_db_dir/rotary_conda_env --conda-frontend mamba --use-conda --reason --rerun-incomplete --printshellcmds --jobs 48 --until circularize' returned non-zero exit status 1.

When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.

Based on a quick skim of the GitHub issue flagged in the Snakemake error message (https://github.com/snakemake/snakemake/issues/823), I suspect that the issue I am running into has to do with a bug in checkpointing. After hitting the split_circular_and_linear_contigs checkpoint, if my understanding is correct, Snakemake re-evaluates the rest of the DAG for the run. Until that point, snakemake doesn't know if the QC'ed reads will be needed again... it depends on if there are any circular contigs that get passed into the circularization module. It seems that Snakemake pre-emptively deleted the temp() QC read files before knowing if it was really safe to do so. As a result, I think the pipeline crashed once it realized that the QC read files were actually necessary files but were missing for continuing the run.

Possible solution

Set the final QC read files as non-temp until we can find a better solution. An upgrade to Snakemake 8 might help fix this problem, or we might need to eventually contact the Snakemake authors about it.

The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

@LeeBergstrand Any thoughts?

LeeBergstrand commented 5 months ago

When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.

This is expected behaviour. The rule 'circularize' is downstream of all the circularization methods. Because it no longer needs temp reads, it deletes them because all rules that use these reads have been running.

LeeBergstrand commented 5 months ago

rotary run -s'--dag --until circularize' | grep -v 'Executing' | dot -Tpdf > until_circ_dag.pdf

until_circ_dag

LeeBergstrand commented 5 months ago

My understanding is that Snakemake rebuilds the DAG every time it is run. So when you go --until circularize, it builds the DAG until the rule species by the --until without regard to the next steps. As you can see from the DAG graph above, it isn't aware of any of the steps, so it deletes the temp files it doesn't need.

LeeBergstrand commented 5 months ago

When I tried to re-start rotary, it wanted to re-built the QC read files from scratch -- these files had been deleted as temp() -- and essentially start the pipeline over. The deletion of the final QC read files is odd, because these QC read files were still needed for the downstream parts of the pipeline (e.g., for another round of polishing during the circularization module). I added fake final QC read files (and a couple intermediate files) back via touch (with appropriate timestamps), and the pipeline continued past the checkpoint error, but (of course) rotary ran into an error when it tried to use the empty QC read files.

This makes sense to me --> it can't find the QCed reads --> says I need to make the QCed reads --> the new QCed reads affect the following steps (in case the read polishing is non-deterministic) --> re-does everything post assembly.

jmtsuji commented 5 months ago

@LeeBergstrand Sorry, I think I didn't explain the problem clearly. You are right that it would be normal for snakemake to delete the short QC read files after reaching the end of the --until step specified. However, in my case, the pipeline is unable to finish the specified circularize rule. It gets stuck at the checkpoint a few steps before that and errors out early. I think the error is because temp() is deleting the QC read files too early.

So: I start the pipeline and tell it --until circularize The pipeline runs until the end of polish module normally (before the circularize module) --> The pipeline doesn't know if short QC reads will be needed or not at the end of the circularize module due to the checkpoint in the middle of the module --> The pipeline pre-emptively deletes the short QC reads (this is the problem, I think) --> The pipeline evaluates the checkpoint partway through the circularize module and re-builds the DAG --> The pipeline sees that short QC reads are actually needed to finish the circularize module --> The pipeline produces the error shown above and refuses to finish the circularize module, because there are no short QC reads.

Does that make more sense? I think it is fundamentally a snakemake bug, but I think we could work around it temporarily by making the QC reads non-temp by default. However, I suspect that the bug will not appear if you have the pipeline run fully to the end. Let me know your thoughts here -- thanks!

jmtsuji commented 5 months ago

Specifically, in the DAG figure you inserted (thanks for adding it!), here is the circularization module:

スクリーンショット 2024-02-16 12 46 08

split_circular_and_linear_contigs is a checkpoint. At that point, the DAG gets re-evaluated.

If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.

LeeBergstrand commented 5 months ago

@jmtsuji Thanks for the explanation. I'm testing on my machine to confirm the behaviour encountered. To clarify, does this error only occur if you run --until circularize and then restart the pipeline? Isn't --until circularize a corner case? I noticed you used it as an example in the README.

LeeBergstrand commented 5 months ago

If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.

Is there a separate rule for this polishing step? or is it calling polishing rules again somehow (is that why there is a {step} in some of the polishing rule strings)?

LeeBergstrand commented 5 months ago

The capability to set the QC read files as non-temp is already included in PR https://github.com/rotary-genomics/rotary/pull/130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

Let's do this until we move to Snakemake 8.

jmtsuji commented 5 months ago

@jmtsuji Thanks for the explanation. I'm testing on my machine to confirm the behaviour encountered. To clarify, does this error only occur if you run --until circularize and then restart the pipeline? Isn't --until circularize a corner case? I noticed you used it as an example in the README.

@LeeBergstrand Thanks for testing -- I am also running a few more test on my end with different run params. For now, I assume the error only occurs if I run --until with something after the checkpoint. No re-start of the pipeline is needed -- the error occurs partway through the run.

Using --until circularize gets you the genome without any annotations. This is a nice way to avoid the most resource and time-intensive parts of the pipeline, e.g., for testing.

jmtsuji commented 5 months ago

If you have any circular contigs, then before combine_circular_and_linear_contigs, the DAG will add a bunch of additional rules upon re-evaluation, including rotating to the start gene and performing one more round of Polypolish-based polishing. The polishing step needs the short QC reads. However, this possible use of the QC short reads seems to be hidden from snakemake until the checkpoint is evaluated.

Is there a separate rule for this polishing step? or is it calling polishing rules again somehow (is that why there is a {step} in some of the polishing rule strings)?

It's calling the Polypolish rule again -- this is why there is a {step} wildcard (which can be either polish or circularize).

jmtsuji commented 5 months ago

The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

Let's do this until we move to Snakemake 8.

OK, sounds good. I am running a couple tests on my end to confirm that this fix works.

LeeBergstrand commented 5 months ago

The capability to set the QC read files as non-temp is already included in PR #130 . I would suggest that we change the default setting in that PR so that the final QC read files are non-temp by default.

Let's do this until we move to Snakemake 8.

OK, sounds good. I am running a couple tests on my end to confirm that this fix works.

Sounds good. Thanks for giving me an overview of how checkpointing works and how it applies to our software. I send you the test results when I get them.

LeeBergstrand commented 5 months ago

@jmtsuji I had to run --until circularize twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with --until circularize again:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - process_start_genes: cram/circularize/identify/cram_start_genes.list
 - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta
 - get_polished_contigs: blam/circularize/filter/blam_circular.fasta
 - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: blam/circularize/identify/blam_start_gene.ffn
 - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log
 - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt
 - get_polished_contigs: cram/circularize/filter/cram_circular.fasta
 - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta
 - symlink_circularization: blam/circularize/blam_circularize.fasta
 - process_start_genes: blam/circularize/identify/blam_start_genes.list
 - get_start_genes: cram/circularize/identify/cram_start_gene.ffn
 - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta
 - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: cram/circularize/cram_circularize.fasta
 - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac
 - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt
 - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log
 - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta
 - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta
 - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta
Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log
LeeBergstrand commented 5 months ago

@jmtsuji I had to run --until circularize twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with --until circularize again:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - process_start_genes: cram/circularize/identify/cram_start_genes.list
 - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta
 - get_polished_contigs: blam/circularize/filter/blam_circular.fasta
 - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: blam/circularize/identify/blam_start_gene.ffn
 - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log
 - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt
 - get_polished_contigs: cram/circularize/filter/cram_circular.fasta
 - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta
 - symlink_circularization: blam/circularize/blam_circularize.fasta
 - process_start_genes: blam/circularize/identify/blam_start_genes.list
 - get_start_genes: cram/circularize/identify/cram_start_gene.ffn
 - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta
 - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: cram/circularize/cram_circularize.fasta
 - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac
 - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt
 - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log
 - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta
 - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta
 - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta
Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log

@jmtsuji I get the above error every time I run with --until circularize.

LeeBergstrand commented 5 months ago

@jmtsuji I had to run --until circularize twice. It was because I had to stop it halfway through because I forgot to update the rotary using git. Interestingly, I ran into a very similar error running rotary with --until circularize again:

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - process_start_genes: cram/circularize/identify/cram_start_genes.list
 - prepare_polypolish_circularize_input: cram/circularize/polypolish/input/cram_input.fasta
 - get_polished_contigs: blam/circularize/filter/blam_circular.fasta
 - map_short_reads_for_polishing: cram/circularize/polypolish/cram_R1.sam, cram/circularize/polypolish/cram_R2.sam, cram/circularize/polypolish/input/cram_input.fasta.0123, cram/circularize/polypolish/input/cram_input.fasta.amb, cram/circularize/polypolish/input/cram_input.fasta.ann, cram/circularize/polypolish/input/cram_input.fasta.bwt.2bit.64, cram/circularize/polypolish/input/cram_input.fasta.pac
 - circularize: checkpoints/circularize
 - get_start_genes: blam/circularize/identify/blam_start_gene.ffn
 - run_circlator: cram/circularize/circlator/cram_rotated.fasta, cram/circularize/circlator, cram/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - polish_polypolish: cram/circularize/polypolish/cram_R1.clean.sam, cram/circularize/polypolish/cram_R2.clean.sam, cram/circularize/polypolish/cram_polypolish.fasta, cram/circularize/polypolish/polypolish.debug.log, cram/stats/circularize/polypolish_changes.log
 - search_contig_start: blam/circularize/identify/blam_circular.faa, blam/circularize/identify/blam_circular.ffn, blam/circularize/identify/blam_circular.gff, blam/circularize/identify/blam_hmmsearch_hits.txt, blam/circularize/identify/blam_hmmsearch_hits_no_comments.txt
 - get_polished_contigs: cram/circularize/filter/cram_circular.fasta
 - combine_circular_and_linear_contigs: blam/circularize/combine/blam_combined.fasta
 - symlink_circularization: blam/circularize/blam_circularize.fasta
 - process_start_genes: blam/circularize/identify/blam_start_genes.list
 - get_start_genes: cram/circularize/identify/cram_start_gene.ffn
 - finalize_circular_contig_rotation: blam/circularize/combine/blam_circular.fasta
 - run_circlator: blam/circularize/circlator/blam_rotated.fasta, blam/circularize/circlator, blam/circularize/circlator/rotated.promer.contigs_with_ends.fa
 - symlink_circularization: cram/circularize/cram_circularize.fasta
 - map_short_reads_for_polishing: blam/circularize/polypolish/blam_R1.sam, blam/circularize/polypolish/blam_R2.sam, blam/circularize/polypolish/input/blam_input.fasta.0123, blam/circularize/polypolish/input/blam_input.fasta.amb, blam/circularize/polypolish/input/blam_input.fasta.ann, blam/circularize/polypolish/input/blam_input.fasta.bwt.2bit.64, blam/circularize/polypolish/input/blam_input.fasta.pac
 - search_contig_start: cram/circularize/identify/cram_circular.faa, cram/circularize/identify/cram_circular.ffn, cram/circularize/identify/cram_circular.gff, cram/circularize/identify/cram_hmmsearch_hits.txt, cram/circularize/identify/cram_hmmsearch_hits_no_comments.txt
 - polish_polypolish: blam/circularize/polypolish/blam_R1.clean.sam, blam/circularize/polypolish/blam_R2.clean.sam, blam/circularize/polypolish/blam_polypolish.fasta, blam/circularize/polypolish/polypolish.debug.log, blam/stats/circularize/polypolish_changes.log
 - combine_circular_and_linear_contigs: cram/circularize/combine/cram_combined.fasta
 - prepare_polypolish_circularize_input: blam/circularize/polypolish/input/blam_input.fasta
 - finalize_circular_contig_rotation: cram/circularize/combine/cram_circular.fasta
Complete log: .snakemake/log/2024-02-16T005846.034888.snakemake.log

@jmtsuji I get the above error every time I run with --until circularize.

Does that make more sense? I think it is fundamentally a snakemake bug, but I think we could work around it temporarily by making the QC reads non-temp by default. However, I suspect that the bug will not appear if you have the pipeline run fully to the end. Let me know your thoughts here -- thanks!

@jmtsuji Changing it to keep the qc reads fixes the above error. For me, the pipeline, when run end to end, still passes even with the keep qced reads set to false in the config.

jmtsuji commented 5 months ago

@LeeBergstrand My tests are also finished :-) Here are my results, using commit 864ca98 (pypolca branch):

Main rotary run with test dataset

If keep_final_qc_read_files is False (current default):

If keep_final_qc_read_files as True (proposed temporary fix):

These results seem to match your tests results -- good.

Pipeline re-run

Interestingly, I also checked what would happen if I tried to re-run the pipeline after it finished for all 4 tests:

If keep_final_qc_read_files is False (current default):

If keep_final_qc_read_files is True (proposed temporary fix):

Summary

Setting keep_final_qc_read_files fixes two issues:

However, the full pipeline still finishes, like you mentioned, even if keep_final_qc_read_files is False

Moving forward

I could go either way with changing keep_final_qc_read_files to True as default. Because the full pipeline still works with keep_final_qc_read_files as False, and discarding the QC read files saves a lot of space, I could see how keeping this feature as default could be nice. We could add a note to the README that keep_final_qc_read_files must be True to run --until circularize (i.e., to skip genome annotation). On the other hand, we could change our default to keep_final_qc_read_files as True to avoid people running into errors.

The re-run tests I did suggest there is a secondary bug with our temp file setup that is preventing the pipeline from properly acknowledging long read QC files. I'll open a second issue for this.

@LeeBergstrand What are your thoughts about how we ought to set keep_final_qc_read_files by default?

LeeBergstrand commented 5 months ago

If keep_final_qc_read_files is False (current default):

  • Whole pipeline: tries to re-analyze the data, starting from long/short read QC
  • --until circularize: tries to re-analyze the data, starting from long/short read QC

If keep_final_qc_read_files is True (proposed temporary fix):

  • Whole pipeline: acknowledges the pipeline is finished and does not re-analyze the data (just re-generates the final checkpoints) <- this is the expected behaviour
  • --until circularize: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing file S18/qc/long/S18_nanopore_qc.fastq.gz (an early temp file from long read QC)

I got the same effect.

LeeBergstrand commented 5 months ago

Moving forward

I could go either way with changing keep_final_qc_read_files to True as default. Because the full pipeline still works with keep_final_qc_read_files as False, and discarding the QC read files saves a lot of space, I could see how keeping this feature as default could be nice. We could add a note to the README that keep_final_qc_read_files must be True to run --until circularize (i.e., to skip genome annotation). On the other hand, we could change our default to keep_final_qc_read_files as True to avoid people running into errors.

The re-run tests I did suggest there is a secondary bug with our temp file setup that is preventing the pipeline from properly acknowledging long read QC files. I'll open a second issue for this.

@LeeBergstrand What are your thoughts about how we ought to set keep_final_qc_read_files by default?

@jmtsuji I recommend we do the following:

LeeBergstrand commented 5 months ago

@jmtsuji I noticed that you use a Symlinking paradigm throughout the pipeline. For example:

# Conditional based on whether short read polishing was performed
rule pre_coverage_filter:
    input:
        "{sample}/polish/medaka/{sample}_consensus.fasta" if POLISH_WITH_SHORT_READS == False else "{sample}/polish/polca/{sample}_polca.fasta"
    output:
        temp("{sample}/polish/cov_filter/{sample}_pre_filtered.fasta")
    run:
        source_relpath = os.path.relpath(str(input),os.path.dirname(str(output)))
        os.symlink(source_relpath,str(output))

This was quite troublesome when I made the temp() file path as Snakemake would start deleting files that the symlinks would point to. I have found that Snakemake doesn't like symlinking because it relies on the presence and absence of files.

Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it.

LeeBergstrand commented 5 months ago

If keep_final_qc_read_files is False (current default):

  • Whole pipeline: tries to re-analyze the data, starting from long/short read QC
  • --until circularize: tries to re-analyze the data, starting from long/short read QC

If keep_final_qc_read_files is True (proposed temporary fix):

  • Whole pipeline: acknowledges the pipeline is finished and does not re-analyze the data (just re-generates the final checkpoints) <- this is the expected behaviour
  • --until circularize: tries to re-analyze the data, starting from long read QC (short read QC is OK). It claims it is looking for the missing file S18/qc/long/S18_nanopore_qc.fastq.gz (an early temp file from long read QC)

I got the same effect.

@jmtsuji It might also rerun the entire pipeline if the config file is modified. My tests aren't really supporting that but I would also look into it. I'm getting the following for rule: nanopore_qc_filter

Reason: Forced execution

Something is causing the forced execution of the very first rule.

jmtsuji commented 5 months ago

@jmtsuji I recommend we do the following:

  • Create two separate issues for this.
  • Set If keep_final_qc_read_files to True by default.
  • Move to Snakemake 8 and reassess this issue.

@LeeBergstrand Thanks for the thoughts -- I agree. Second issue for checkpointing is already created at #134 , and keep_final_qc_read_files is now set as True by default in PR #138 (thanks for this). Let's keep this issue open and re-assess this bug after moving to Snakemake 8.

jmtsuji commented 5 months ago

This was quite troublesome when I made the temp() file path as Snakemake would start deleting files that the symlinks would point to. I have found that Snakemake doesn't like symlinking because it relies on the presence and absence of files.

Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it.

I've also run into symlinking issues while editing the pipeline and agree, it would be better to streamline these parts, if possible. Reducing the number of steps in the analysis this way might help with consistent DAG construction. (Just for reference for the future: for the polish steps, I think it's necessary to either copy or symlink the files before running the polish rules in order for the rules to be compatible with inputs from multiple modules. This is not the case for other rules, I think, so symlinking steps can probably be streamlined in many other cases.) I'll move this to a separate issue.

jmtsuji commented 5 months ago

@jmtsuji It might also rerun the entire pipeline if the config file is modified. My tests aren't really supporting that but I would also look into it. I'm getting the following for rule: nanopore_qc_filter

Reason: Forced execution

Something is causing the forced execution of the very first rule.

Also interesting -- I haven't seen "Forced execution" as a reason to re-run nanopore_qc_filter in my tests, but I was running with a fresh install and config of rotary. I think snakemake also sometimes forces re-runs if it notices a snakefile has been updated compared to the last run.