snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

`sbatch` call does not escape quotes in `exec_job` #29

Open ebete opened 9 months ago

ebete commented 9 months ago

The sbatch command constructed in snakemake_executor_plugin_slurm.Executor().run_job() does not seem to escape quotes in the exec_job variable, causing issues at the following line:

https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/cf0560cfac101af9600d7a1f7e3e21f5485d47c4/snakemake_executor_plugin_slurm/__init__.py#L133

When exec_job contains a double quote ("), it will generate a not-so-helpful error message:

sbatch: error: script arguments not permitted with --wrap option

Extra info

snakemake 8.4.0
snakemake-executor-plugin-slurm 0.2.1
python 3.11.7
slurm 20.11.9
CentOS 7.9.2009
cmeesters commented 9 months ago

looking into it. Can you provide a minimal example?

gipert commented 9 months ago

Just stumbled upon this too. Here's a MWE:

# Snakefile
rule all:
    input: "a.dat", "b.dat"

rule a:
    output: "{sample}.dat"
    resources: runtime=50
    group: "foo"
    shell: "touch {output} && sleep 200"
# default/config.yaml
# works on the NERSC supercomputer
executor: slurm
default-resources:
  - slurm_account=m2676
  - constraint=cpu
  - slurm_extra="--qos regular"

jobs: unlimited
cores: all
group-components:
  - foo=2

execute with snakemake --profile default. Output (the missing partition error is innocuous):

Using profile default for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 9223372036854775807
Job stats:
job      count
-----  -------
a            2
all          1
total        3

Select jobs to execute...
Execute 1 jobs...
[Thu Feb  1 06:42:52 2024]

group job foo (jobs in lexicogr. order):

    [Thu Feb  1 06:42:52 2024]
    rule a:
        output: a.dat
        jobid: 1
        reason: Code has changed since last execution
        wildcards: sample=a
        resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=m2676, constraint=cpu, slurm_extra=--qos regular, runtime=50

    [Thu Feb  1 06:42:52 2024]
    rule a:
        output: b.dat
        jobid: 2
        reason: Code has changed since last execution
        wildcards: sample=b
        resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=m2676, constraint=cpu, slurm_extra=--qos regular, runtime=50

No partition was given for rule 'JobGroup(foo,frozenset({a, a}))', and unable to find a default partition. Trying to submit without partition information. You may want to invoke snakemake with --default-resources 'slurm_partition=<your default partition>'.
WorkflowError:
SLURM job submission failed. The error message was sbatch: error: Script arguments not permitted with --wrap option
cmeesters commented 9 months ago

fixed in release 8.4.3 of the snakemake core - however, the odd propagation of function values instead of string flags is just fixed. A release is forthcoming. Stay tuned.

ebete commented 9 months ago

Unfortunately I still get the same error using snakemake 8.4.4 + snakemake-executor-plugin-slurm 0.3.0. The problem for me at least is that --apptainer-args is passed with mixed quotes:

See the sbatch command (added some line breaks for clarity):

sbatch call: 
sbatch 
--job-name 851bba7b-9529-4d79-90e2-69cb693d8d46 --output $PROJ/.snakemake/slurm_logs/rule_qiime_import/%j.log --export=ALL --comment qiime_import -A slurm -p cpu -t 900 --mem 50000 --cpus-per-task=1 -D $PROJ 
--wrap="
mambaforge/envs/snakemake/bin/python3.11 -m snakemake --snakefile '$PROJ/Snakefile' --target-jobs 'qiime_import:project=result' --allowed-rules 'qiime_import' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=50000' 'mem_mib=47684' 'disk_mb=2676' 'disk_mib=2553' 'runtime_max=86400' 'load=0' --wait-for-files-file '$PROJ/.snakemake/tmp.qhihjnxt/j339.qiime_import.snk.waitforfilesfile.txt' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete 
--verbose 
--rerun-triggers input params software-env mtime code 
--deployment-method conda apptainer 
--conda-frontend 'mamba' 
--conda-prefix '.snakemake-envs' 
--conda-base-path 'mambaforge/envs/snakemake' 
--apptainer-prefix '.snakemake-envs' 

--apptainer-args "--bind '$one_path:$one_path:rw' --bind '$other_path:$other_path:rw'" 

--shared-fs-usage software-deployment input-output persistence sources source-cache storage-local-copies 
--wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' 
--configfiles $PROJ/setup.yml 
--printshellcmds  
--latency-wait 60 
--scheduler 'ilp' 
--scheduler-solver-path 'mambaforge/envs/snakemake/bin' 
--default-resources 'mem_mb=1000' 'disk_mb=max(2*input.size_mb, 1000)' 'tmpdir=system_tmpdir' 'slurm_account=slurm' 'slurm_partition=cpu' 'runtime=1440' 'runtime_max=86400' 'load=0' 
--executor slurm-jobstep 
--jobs 1 
--mode 'remote'
"

.. SLURM job submission failed. The error message was sbatch: error: Script arguments not permitted with --wrap option

for now I worked around this by removing the quotes around the --bind '...', but that'll only hold up as long as there are no spaces in these paths. I'm not sure what the best solution would be as you basically need both quotes around the --apptainer-args arguments and the --bind ones. Escaping the quotes gave me a different set of problems.

If nothing else, it may be an idea to just put an error message here like:

if exec_job.contains('"'):
    raise ValueError("SLURM submission command contains illegal character \"")
cmeesters commented 9 months ago

What a command line!

Currently, I am unable to test - no cluster (maintenance). Can you indicate the malicious line of your Snakemake command line or your resource yaml, please?

gipert commented 9 months ago

The MWE I posted earlier keeps failing with Snakemake 8.4.3, for the record.

ebete commented 9 months ago

@cmeesters A large part of that is generated by snakemake itself luckily.

The error stems from the --apptainer-args argument with mixed quotations.

Anyway, a (minimal) reproducible example (requires singularity/apptainer + SLURM):

slurm_test.snk

# snakemake

# set up -------------------------------------------------------------------------------------------
from snakemake.utils import min_version

min_version("8.0.0")

localrules: all
container: "docker://condaforge/mambaforge:latest"

SAMPLE_NAMES = [f"file_{x:d}" for x in range(1,5)]
OUTPUT_DIR = "results_test"

# start --------------------------------------------------------------------------------------------
rule all:
    message: "All target files have been generated/updated"
    default_target: True
    threads: 1
    input:
        expand("{project}/{sample}.txt", sample=SAMPLE_NAMES, project=OUTPUT_DIR)

rule gen_file:
    output:
        "{project}/{sample}.txt"
    threads: 1
    resources:
        load = 100
    shell:
        'printf "now: %s\n" "$(date)" > {output:q}'

run.sh

#!/usr/bin/env bash

set -o nounset
set -o errexit
set -o pipefail

extra_params=${@:1}
# set this script path as the working directory
cd -P -- $(dirname "$0")

echo "Snakemake $(snakemake --version) | $(python -VV)" >&2
echo "[RUN_TEST] Running pipeline with test dataset ..." >&2
snakemake \
    --snakefile "slurm_test.snk" \
    --executor slurm \
    --local-cores 1 \
    --jobs 4 \
    --rerun-incomplete \
    --software-deployment-method apptainer \
    --apptainer-prefix "$HOME/.snakemake-envs" \
    --apptainer-args "--bind '$HOME/.snakemake-envs:$HOME/.snakemake-envs:rw'" \
    --printshellcmds \
    --restart-times 0 \
    --show-failed-logs \
    --resources 'load=100' \
    --default-resources \
        'load=0' \
        'runtime=600' \
        'mem_mb=1024' \
        'slurm_account=slurm' \
        'slurm_partition=cpu' \
    --cores 1 \
    --forceall \
    $extra_params
bdelepine commented 9 months ago

Hi everyone,

It seems that the same issue occurs when a workflow-profile configuration file contains a config keyword to store some configuration variables. Please find hereafter a MWE and a workaround for this specific problem.

Snakefile:

rule all:
  output:
    touch("all.done")
  shell:
    "wait 10"

profile/config.yaml that contains some configuration variables:

executor: "slurm"
default-resources:
  slurm_account: "XXXXXXXXXXXX"
  slurm_partition: "XXXX"
  runtime: 1435
  cpus_per_task: 1
  nodes: 1
  tasks: 1
  mem_mb: 2000
jobs: 1
config:
  importantlist:
    - "item1"
    - 'item2'
    - item3

Calling snakemake with snakemake --workflow-profile profile --verbose, we get:

rule all:
    output: all.done
    jobid: 0
    reason: Missing output files: all.done
    resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_account=XXXXXXX, slurm_partition=fast, runtime=1435, cpus_per_task=1, nodes=1, tasks=1

sbatch call: sbatch --job-name fee9209a-17a3-46d0-bab5-0765fb728317 --output /redacted/path/ --export=ALL --comment all -A XXXXX -p XXXXX -t 1435 --mem 2000 --cpus-per-task=1 -D /redacted/path/ --wrap="/redacted/path/bin/python3.12 -m snakemake --snakefile '/redacted/path/Snakefile' --target-jobs 'all:' --allowed-rules 'all' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=2000' 'mem_mib=1908' 'disk_mb=1000' 'disk_mib=954' 'cpus_per_task=1' 'nodes=1' 'tasks=1' --wait-for-files '/redacted/path/.snakemake/tmp.6vkh9cws' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers code software-env params input mtime --conda-frontend 'mamba' --shared-fs-usage software-deployment sources input-output storage-local-copies persistence source-cache --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --config "importantlist=['item1', 'item2', 'item3']" --latency-wait 5 --scheduler 'ilp' --scheduler-solver-path '/redacted/path/envs/snakemake-8.0.1/bin' --default-resources 'mem_mb=2000' 'disk_mb=max(2*input.size_mb, 1000)' 'tmpdir=system_tmpdir' 'slurm_account=XXXX' 'slurm_partition=XXXX' 'runtime=1435' 'cpus_per_task=1' 'nodes=1' 'tasks=1' --executor slurm-jobstep --jobs 1 --mode 'remote'"
unlocking
removing lock
removing lock
removed all locks

[...]

WorkflowError:
SLURM job submission failed. The error message was sbatch: error: Script arguments not permitted with --wrap option

Note that the config keyword was translated to --config "importantlist=['item1', 'item2', 'item3']", the un-escaped " been likely to cause the WorkflowError since it is already used by --wrap="[...]".

As a workaround, I moved my config parameters to a config.yaml file:

importantlist:
  - "item1"
  - 'item2'
  - item3

Running snakemake as snakemake --workflow-profile profile --configfile config.yaml --verbose works as expected.

Tested with:

HTH, best,

cmeesters commented 9 months ago

Please update to the current version of the executor and Snakemake itself and try again. My own test is working, now.

Note, that you ought to write: slurm_extra="'--flag1=arg1 --flag2=arg2'"

ebete commented 9 months ago

@cmeesters Still the same exception.

WorkflowError:
SLURM job submission failed. The error message was sbatch: error: Script arguments not permitted with --wrap option

snakemake 8.4.11
snakemake-executor-plugin-slurm 0.3.2
python 3.11.8
cmeesters commented 8 months ago

Sorry, I could not get back to this issue any sooner.

@ebete With $ snakemake --workflow-profile ./profiles -j unlimited (rest are my defaults) and

$ cat ./profiles/config.yaml
default-resources:
    slurm_partition: "<partition>"
    slurm_account: "<account>"
    runtime: "10h"
    apptainer_prefix: "$HOME/.snakemake-envs"
    apptainer_args: "--bind '$HOME/.snakemake-enve:$HOME/.snakemake-envs:rw'"

set-resources:
    gen_file:
        load: 100

I could get your testcase to work. Before that, I think we went into Dante's fourth hell of quotation nesting.

cmeesters commented 8 months ago

@ebete is it working? Or do these issues persist?

ebete commented 8 months ago

@cmeesters As of now, the issue still persists. I do have a small patch that should solve it though:

diff --git a/snakemake_executor_plugin_slurm/__init__.py b/snakemake_executor_plugin_slurm/__init__.py
index a97f4aa..a214983 100644
--- a/snakemake_executor_plugin_slurm/__init__.py
+++ b/snakemake_executor_plugin_slurm/__init__.py
@@ -6,6 +6,7 @@ __license__ = "MIT"
 import csv
 from io import StringIO
 import os
+import re
 import subprocess
 import time
 from datetime import datetime, timedelta
@@ -141,6 +142,8 @@ class Executor(RemoteExecutor):
         # (see https://github.com/snakemake/snakemake/issues/2014)
         call += f" -D {self.workflow.workdir_init}"
         # and finally the job to execute with all the snakemake parameters
+        # escape any unescaped double quotes to prevent the --wrap="..." to be prematurely closed
+        exec_job = re.sub(r'(?<!\\)"', r"\"", exec_job)
         call += f' --wrap="{exec_job}"'

         self.logger.debug(f"sbatch call: {call}")
cmeesters commented 8 months ago

Looks nice! Out of curiosity: Did it fail the same way as before using the workflow-profile?

Anyway, I do not want to leave you hanging. Yet, I just got a new student, a workshop for Wednesday to prepare and will take a few days off, because our school is closed. With other words: little time for testing. Will do so a.s.a.p.

ebete commented 8 months ago

I reckon that the profile would work. But in my case I'm generating the --binds dynamically in a wrapper script, hence a static yml file would not suffice.

And no worries, enjoy the break :)