rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Pipeline tries to re-run from QC when re-started #134

Open jmtsuji opened 5 months ago

jmtsuji commented 5 months ago

System info

Using rotary commit 864ca98 on a Linux server (Ubuntu 22.04.3).

Problem description

I can run rotary run to the end of the pipeline without issues. However, as described in #132 , if I try to re-run rotary on the same output directory with current default settings, rotary tries to re-analyze the data from partway through the QC module, instead of correctly acknowledging that the run is already finished. This is non-ideal, because it essentially prevents users from re-analyzing their rotary data.

This is also a problem when trying to re-analyze certain incomplete rotary runs (e.g., with --until circularize).

Setting keep_final_qc_read_files to True seems to let rotary acknowledge properly that a rotary run is complete, when running until the end of the pipeline.

The main rule that seems to be an issue is rule nanopore_qc_filter -- rotary tries to re-start from here, with the reason Missing output files: S18/qc/long/S18_nanopore_qc.fastq.gz (I am using the sample name S18). Some runs also try to re-start rule short_read_reformat, but I think this might be related to #132 .

Possible causes

I wonder if we are not leaving enough intermediate analysis files behind for rotary to acknowledge that QC finished properly, particularly for long reads. Only the secondary file S18/qc/long/S18_length_hist.tsv is left behind (I am using the sample name S18). The checkpoint files qc_long and qc are both temp files that are deleted at the end of the pipeline.

Proposed solution

I haven't tested this yet, but I wonder if just setting the touched checkpoint file qc to non-temp would fix our problem. We might need to do this for all of the touched checkpoint files in the pipeline (e.g., assembly, polish, and so on). Also, these checkpoints are currently global for all samples, but we could consider making sample-specific checkpoint files (e.g., S18/checkpoints/qc, rather than just checkpoints/qc) so give more fine-scale checkpoints.

An alternative would be to make some kind of summary report file for QC that requires the final QC read files as input. I think this might be enough for the DAG to figure out that the QC read files were created at some point in the pipeline. We have summary output/report files like this for the other rotary modules.

Things to consider

Assuming both of the solutions I've proposed above are effective, I think that the 2nd proposed solution (to make a QC report/summary file) is more elegant, because it will prevent us from needing to leave many random checkpoint files behind at the end of a run. However, having the non-temp checkpoint files could make the pipeline more robust to future code changes.