Closed katievigil closed 11 months ago
There isn't an easy way to guess the runtime as it will be affected by the repetitiveness and genome size of the organism not just read counts. Canu is itself a pipeline which submits itself to the grid to run so the total runtime will also depend on the grid availability/scheduling time. It does have onSuccess
/onFailure
options which let you specify a script to run after Canu completes. You could then run snakemake up to Canu, stop snakemake, and let Canu resume the snakemake using the onSuccess command or have that write a checkpoint file that a process in snakemake can busy-wait on (see #471 for some discussion). Alternatively, if you're only running small assemblies, you could restrict Canu to a single node so it won't complete until the whole assembly is finished by adding useGrid=false
to your command line.
Thank you! so I can add the onSuccess in the "params" of my Canu Rule in Snakemake? What do I set the onSuccess=?
Here are my Canu snakemake rules:
rule canu_assembly: input: fastq="/lustre/project/taw/kvigil/ONR/baratariabay/ONR_baratariabay_20_100623/20231006_1648_MN18851_FAW76720_acec0fdf/fastq_pass/concatenate/{barcode}.fastq"
output:
assembly="assembly/{barcode}/contigs.fasta",
report="assembly/{barcode}/canu-report.html"
params:
genomeSize="2m",
minInputCoverage=0,
maxInputCoverage=0,
corOutCoverage=10000,
stopOnLowCoverage=0,
corMinCoverage=0,
redMemory=32,
oeaMemory=32,
batMemory=32,
correctedErrorRate=0.2,
corMhapSensitivity="high"
onSuccess=?????
threads: 8
shell:
"/lustre/project/taw/share/conda-envs/ONRviral/bin/canu -p {wildcards.barcode} -d {output.assembly} onSuccess={params.onSuccess} genomeSize={params.genomeSize} minInputCoverage={params.minInputCoverage} maxInputCoverage={params.maxInputCoverage} corOutCoverage={params.corOutCoverage} stopOnLowCoverage={params.stopOnLowCoverage} corMinCoverage={params.corMinCoverage} redMemory={params.redMemory} batMemory={params.batMemory} oeaMemory={params.oeaMemory} correctedErrorRate={params.correctedErrorRate} corMhapSensitivity={params.corMhapSensitivity} -nanopore {input.fastq}"
https://canu.readthedocs.io/en/latest/parameter-reference.html#global-options (scroll down a bit)
Execute the command supplied when Canu successfully completes an assembly.
The command will execute in the <assembly-directory> (the -d option to canu)
and will be supplied with the name of the assembly (the -p option to canu)
as its first and only parameter.
You could use this to (a) launch snakemake to run the remaining processing steps on the now-complete assembly; (b) create a file that snakemake is actively waiting for (while file-doesn't-exist { sleep 10}
); (c) some other fiendishly clever method of telling snakemake to continue. onFailure
is similar, but it runs when the assembly fails.
Note that the danger with a busy-wait is that if canu does not finish successfully, snakemake will wait forever.
Note also that in option (a) it is better to submit a job to the grid to run snakemake than to run snakemake directly.
@brianwalenz thank you for this information. I am still working on it!
I know this is quite a late reply but I have been working on implementing a snakemake pipeline for the past two weeks and have it successfully working.
Waiting at most 3600 seconds for missing files. MissingOutputException in rule canu_assembly in file /lustre/project/taw/share/conda-envs/snakemake-env/nanopore_viralv3, line 21: Job 1 completed successfully, but some output files are missing. Missing files after 3600 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait: assembly/barcode03/canu-report.html Removing output files of failed job canu_assembly since they might be corrupted: assembly/barcode03/contigs.fasta Skipped removing non-empty directory assembly/barcode03/contigs.fasta Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2023-10-17T080518.997070.snakemake.log
so I am working on the same thing, parallel canu via snakemake, and have found that snakemake is particular about the outputs that you give it. So to chain outputs, you do need to have the file listed so it can build the DAG. However canu takes a directory as output. The way that I have made this work in my snakefiles is to include multiple outputs:
I think that your error is from snakemake feeding the contigs.fasta as the output directory, canu accepting that as a valid output directory, and creating it. Once it's created, snakemake sees that the output has been generated and assumes that the rule has finished. If you're running on an HPC cluster and useGrid=true (or blank), then from snakemake's perspective, that initial rule has finished, the output directory "contigs.fasta/" exists, but the report is missing because canu has not actually finished running.
To add to what @skoren wrote about setting onsuccess/failure, you could also use checkpoints in snakemake to ensure completion before other rules are executed.
I've rewritten your rule below and added what should be written in rule all:
rule all:
input:
........
"assembly/{barcode}/{barcode}.contigs.fasta",
rule canu_assembly:
input:
fastq="/lustre/project/taw/kvigil/ONR/baratariabay/ONR_baratariabay_20_100623/20231006_1648_MN18851_FAW76720_acec0fdf/fastq_pass/concatenate/{barcode}.fastq"
output:
outdir = directory("assembly/{barcode}/"),
assembly="assembly/{barcode}/{barcode}.contigs.fasta",
checkpoint=touch("checkpoints/{barcode}.canu.finished")
params:
genomeSize="2m",
minInputCoverage=0,
maxInputCoverage=0,
corOutCoverage=10000,
stopOnLowCoverage=0,
corMinCoverage=0,
redMemory=32,
oeaMemory=32,
batMemory=32,
correctedErrorRate=0.2,
corMhapSensitivity="high"
useGrid="false"
log: "logs/canu/{sample}.canu.log"
message:
"Running Canu for sample {wildcards.barcode}"
threads: 8
shell:
"/lustre/project/taw/share/conda-envs/ONRviral/bin/canu -p {wildcards.barcode} -d {output.outdir} "
"genomeSize={params.genomeSize} minInputCoverage={params.minInputCoverage} "
"maxInputCoverage={params.maxInputCoverage} corOutCoverage={params.corOutCoverage} "
"stopOnLowCoverage={params.stopOnLowCoverage} corMinCoverage={params.corMinCoverage} "
"redMemory={params.redMemory} batMemory={params.batMemory} oeaMemory={params.oeaMemory} "
"correctedErrorRate={params.correctedErrorRate} corMhapSensitivity={params.corMhapSensitivity} -nanopore {input.fastq} 2>{log} "
Hi,
I have successfully used Canu for all my de novo assemblies and I am embarking on creating a snakemake-env for running Canu+medaka+diamond for viral metagenomic sequencing. Has anyone successfully used Canu with snakemake? I am running into the issue where I need to add --latency-wait time for Canu to create the final assembly.fasta file before moving onto my Medaka rule. I am getting this error (see below). How can I figure out how much time it will take Canu to finish the assembly? When I ran Canu before I never really knew how long the assembly would take, is there a way to figure this out based on how many reads I have?
Below is my snakemake-env error: