snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

Dynamic runtime resource fails with SLURM #62

Open nikostr opened 7 months ago

nikostr commented 7 months ago

I've created a minimal workflow and I set runtime: f"{2 + attempt}h" in my workflow profile. It is correctly parsed by snakemake in the sense that it prints runtime=180 as a part of the resources, but I get the error SLURM job submission failed. The error message was sbatch: error: Script arguments not permitted with --wrap option. I improvised the runtime specification since I couldn't find a documented way of doing it - is there a recommended/working way to specify dynamic runtimes in the profile?

cmeesters commented 7 months ago

Outch. Thanks for the report!

Can you please attach your minimal example? And perhaps a log create with snakemake --verbose ..., too? That would be extremely helpful.

nikostr commented 7 months ago

Sure! workflow/Snakefile:

rule all:
    output:
        'results/a'
    shell:
        ''

workflow/profiles/default/config.yaml:

executor: slurm
jobs: 1
retries: 2
default-resources:
  slurm_account: <account>
  runtime: f"{2 + attempt}h"
  slurm_partition: core

and a slightly redacted version of the verbose log:

Using workflow specific profile workflow/profiles/default for setting default command line arguments.
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: d71a0ae6-210a-4886-b197-508e567eb099
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job      count
-----  -------
all          1
total        1

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Inferred runtime value of 180 minutes from 3h
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Fri Apr  5 11:29:29 2024]
rule all:
    output: results/a
    jobid: 0
    reason: Missing output files: results/a
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=$SLURM_ACCOUNT, runtime=180, slurm_partition=core

sbatch call: sbatch --job-name d71a0ae6-210a-4886-b197-508e567eb099 --output $DIR/snakemake-runtime-bug/.snakemake/slurm_logs/rule_all/%j.log --export=ALL --comment all -A $SLURM_ACCOUNT -p core -t 180 --mem 1000 --cpus-per-task=1 -D $DIR/snakemake-runtime-bug --wrap="$HOME/.conda/envs/snakemake/bin/python3.12 -m snakemake --snakefile $DIR/snakemake-runtime-bug/workflow/Snakefile --target-jobs all: --allowed-rules all --cores all --attempt 1 --force-use-threads  --resources mem_mb=1000 mem_mib=954 disk_mb=1000 disk_mib=954 --wait-for-files $DIR/snakemake-runtime-bug/.snakemake/tmp._isktcvq --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers mtime software-env code params input --conda-frontend mamba --shared-fs-usage input-output storage-local-copies software-deployment source-cache persistence sources --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --latency-wait 5 --scheduler ilp --local-storage-prefix .snakemake/storage --scheduler-solver-path $HOME/.conda/envs/snakemake/bin --default-resources 'mem_mb=min(max(2*input.size_mb, 1000), 8000)' 'disk_mb=max(2*input.size_mb, 1000)' tmpdir=system_tmpdir slurm_account=$SLURM_ACCOUNT 'runtime=f"{2 + attempt}h"' slurm_partition=core --executor slurm-jobstep --jobs 1 --mode remote"
unlocking
removing lock
removing lock
removed all locks
Full Traceback (most recent call last):
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake_executor_plugin_slurm/__init__.py", line 138, in run_job
    out = subprocess.check_output(
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/.conda/envs/snakemake/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/.conda/envs/snakemake/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'sbatch --job-name d71a0ae6-210a-4886-b197-508e567eb099 --output $DIR/snakemake-runtime-bug/.snakemake/slurm_logs/rule_all/%j.log --export=ALL --comment all -A $SLURM_ACCOUNT -p core -t 180 --mem 1000 --cpus-per-task=1 -D $DIR/snakemake-runtime-bug --wrap="$HOME/.conda/envs/snakemake/bin/python3.12 -m snakemake --snakefile $DIR/snakemake-runtime-bug/workflow/Snakefile --target-jobs all: --allowed-rules all --cores all --attempt 1 --force-use-threads  --resources mem_mb=1000 mem_mib=954 disk_mb=1000 disk_mib=954 --wait-for-files $DIR/snakemake-runtime-bug/.snakemake/tmp._isktcvq --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers mtime software-env code params input --conda-frontend mamba --shared-fs-usage input-output storage-local-copies software-deployment source-cache persistence sources --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --latency-wait 5 --scheduler ilp --local-storage-prefix .snakemake/storage --scheduler-solver-path $HOME/.conda/envs/snakemake/bin --default-resources 'mem_mb=min(max(2*input.size_mb, 1000), 8000)' 'disk_mb=max(2*input.size_mb, 1000)' tmpdir=system_tmpdir slurm_account=$SLURM_ACCOUNT 'runtime=f"{2 + attempt}h"' slurm_partition=core --executor slurm-jobstep --jobs 1 --mode remote"' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake/cli.py", line 2052, in args_to_api
    dag_api.execute_workflow(
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 589, in execute_workflow
    workflow.execute(
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1247, in execute
    raise e
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1243, in execute
    success = self.scheduler.schedule()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake/scheduler.py", line 306, in schedule
    self.run(runjobs)
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake/scheduler.py", line 394, in run
    executor.run_jobs(jobs)
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake_interface_executor_plugins/executors/base.py", line 72, in run_jobs
    self.run_job(job)
  File "$HOME/.conda/envs/snakemake/lib/python3.12/site-packages/snakemake_executor_plugin_slurm/__init__.py", line 142, in run_job
    raise WorkflowError(
snakemake_interface_common.exceptions.WorkflowError: SLURM job submission failed. The error message was sbatch: error: Script arguments not permitted with --wrap option

WorkflowError:
SLURM job submission failed. The error message was sbatch: error: Script arguments not permitted with --wrap option
nikostr commented 7 months ago

I just tried replacing the runtime with str(2 + attempt) + "h" and it seems to work! Would this be the recommended way to do this? Would it make sense to add this to the documentation?

EDIT: tried this again, and this time it protested. Additional verification needed.