snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
15 stars 17 forks source link

Incorrect DAG evaluation in jobstep with ruleorder and checkpoint #60

Closed cademirch closed 6 months ago

cademirch commented 6 months ago

I have a workflow that uses ruleorder to "decide" between two rules depending on the input. In my use case I want to either download a reference genome, or copy a locally provided one to the results directory. Downstream there is a checkpoint, and finally a rule that takes as input the reference genome and checkpoint output.

There is no issue running this workflow locally. However, when executing on slurm, the rule downstream of the checkpoint fails. Inspecting the slurm log for that rule shows that the DAG is evaluated incorrectly. I've provided a minimal example that reproduces this behavior below.

With this example, the do_stuff rule fails only when the wildcard genome == 0. In the slurm log for this, the DAG is built, and decides to run the copy_ref rule. This is despite 1) all input for do_stuff was present at time of job submission, and 2) copy_ref is not the right rule to run given the ruleorder.

I'm curious if this could be caused by the reason: Forced execution in the slurm job, though I'm unsure.

Appreciate any help!

Snakefile:

ruleorder: copy_genome > sim_download_genome

from pathlib import Path

g = Path("./scatch/genome1.fa")
if not g.exists():
    g.parent.mkdir(exist_ok=True)
    with open(g, "w") as f:
        print("hi", file=f)

def get_genome(wc):
    if wc.genome == "0":
        return "Need to download"
    elif wc.genome == "1":
        return "./scatch/genome1.fa"

def checkpoint_func(wc):
    checkpoint_output = checkpoints.intermediate_step.get(**wc).output[0]
    return checkpoint_output

rule all:
    input:
        expand("results/{genome}.somestuff", genome=[0, 1]),

rule copy_genome:
    input:
        get_genome,
    output:
        "results/{genome}.fa",
    shell:
        "cp {input} {output}"

rule sim_download_genome:
    output:
        "results/{genome}.fa",
    shell:
        "echo hi > {output}"

rule sim_index_genome:
    input:
        "results/{genome}.fa",
    output:
        "results/{genome}.fa.index",
    shell:
        "cp {input} {output}"

checkpoint intermediate_step:
    input:
        ref="results/{genome}.fa",
    output:
        "results/{genome}.int",
    shell:
        "echo hi > {output}"

rule do_stuff:
    input:
        chk=checkpoint_func,
        ref="results/{genome}.fa",
        ind="results/{genome}.fa.index",
    output:
        "results/{genome}.somestuff",
    shell:
        "cat {input} > {output}"

Snakemake error:

Error in rule do_stuff:
    message: SLURM-job '3248276' failed, SLURM status is: 'FAILED'For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 1
    input: results/0.int, results/0.fa, results/0.fa.index
    output: results/0.somestuff
    log: /private/groups/russelllab/cade/snakeissue/.snakemake/slurm_logs/rule_do_stuff/3248276_0.log (check log file(s) for error details)
    shell:
        cat results/0.int results/0.fa results/0.fa.index > results/0.somestuff
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 3248276

Slurm log for that job:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Select jobs to execute...
Execute 1 jobs...

[Wed Apr  3 14:18:37 2024]
rule copy_genome:
    input: Need to download
    output: results/0.fa
    jobid: 2
    reason: Set of input files has changed since last execution; Code has changed since last execution
    wildcards: genome=0
    resources: mem_mb=<TBD>, disk_mb=<TBD>, tmpdir=<TBD>

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Select jobs to execute...
Execute 1 jobs...

[Wed Apr  3 14:18:38 2024]
localrule copy_genome:
    input: Need to download
    output: results/0.fa
    jobid: 0
    reason: Forced execution
    wildcards: genome=0
    resources: mem_mb=<TBD>, disk_mb=<TBD>, tmpdir=/data/tmp

Waiting at most 5 seconds for missing files.
WorkflowError in rule copy_genome in file /private/groups/russelllab/cade/snakeissue/Snakefile, line 30:
OSError: Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
Need to download (missing locally, parent dir not present)
srun: error: phoenix-21: task 0: Exited with exit code 1
[Wed Apr  3 14:18:43 2024]
Error in rule copy_genome:
    jobid: 2
    input: Need to download
    output: results/0.fa
    shell:
        cp Need to download results/0.fa
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Storing output in storage.
WorkflowError:
At least one job did not complete successfully.
cmeesters commented 6 months ago

Thank you for your report and the minimal example.

However, I have bad or good news, depending on your view: The minimal example started working after a few tweaks. Specially: Need to download is not produced. And there is no escaping. So touching it and working with Need\ to\ download is an alternative or simply using the attached Snakefile.

Snakefile.txt

cademirch commented 6 months ago

Thanks for the quick response!

I'm not sure I follow. In the Snakefile you uploaded, the get_genome function returns the same genome file for both wildcard values, which is not what I'm trying to do.

Can you explain what you mean by touching "need to download" and working with it?

nikostr commented 6 months ago

When you return "Need to download" Snakemake interprets Need to download as an input file to your rule. If you replace that return line with a raise your get_genome function will error out and sim_download_genome will be used instead.

EDIT: For an example of this being done in an actual workflow, see e.g. https://github.com/nikostr/read-mapping/blob/7f761dfe85bc1e532bebd9efa2c61ff2246ccbdb/workflow/rules/common.smk#L30C1-L42C10 and https://github.com/nikostr/read-mapping/blob/main/workflow/rules/trimming.smk

cmeesters commented 6 months ago

@nikostr thanks for helping out!

cademirch commented 6 months ago

@nikostr Thanks, I'll try that. It's interesting because the "need to download" works when executed locally. So there is something different about executing on slurm/cluster. Anyway it may not matter since it makes way more sense to raise an error than return a fake file.

cmeesters commented 6 months ago

One thing which comes to mind: Do your admins allow internet download on compute nodes?

cademirch commented 6 months ago

They do, but I'm not sure it matters in this case as even the MRE (which doesn't do any network things) fails.

cademirch commented 6 months ago

Looks like switching the return "Need to Download" to raise in the input function fixes the issue. Thanks @nikostr and @cmeesters