snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
9 stars 13 forks source link

Unterminated string literal syntax error parsing slurm_extra #18

Closed paudano closed 2 months ago

paudano commented 6 months ago

When parsing a slurm_extra resource, I get SyntaxError: unterminated string literal.

In my profile, I have set-resources with - test_rule:slurm_extra=--queue='gpu_dev'. The error message says:

Failed to evaluate default resources value '--queue='gpu_dev'.

The error message has imbalanced single-quotes. The first and last are part of the error-handling code, so it looks like it's trying to eval --queue='gpu_dev. I'm not sure where the quotes are being stripped, and trying to "trick" it by adding additional quotes to the end of the slurm_extra parameter doesn't get around it.

Snakemake: 8.1.0 snakemake-executor-plugin-slurm: 0.1.4 python: 3.12.1

Command and output:

$ snakemake --executor slurm --profile profiles/slurm -j 1
Using profile profiles/slurm for setting default command line arguments.
Building DAG of jobs...
Retrieving input from storage.
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job          count
---------  -------
test_rule        1
total            1

Select jobs to execute...
InputFunctionException in rule test_rule in file /temp/Snakefile, line 3:
Error:
  WorkflowError:
    Failed to evaluate default resources value '--queue='gpu_dev'.
        String arguments may need additional quoting. Ex: --default-resources "tmpdir='/home/user/tmp'".
    SyntaxError: unterminated string literal (detected at line 1) (<string>, line 1)
Wildcards:

Traceback:
 (rule test_rule, line 3, /temp/Snakefile)

Attached Snakemake and profile: unterminated_issue.tar.gz

cmeesters commented 6 months ago

Please try: slurm_extra="--queue='gpu_dev'" like the help message to the error suggests. However: Note, that SLURM does not know a --queue flag.

paudano commented 6 months ago

Of course I tried that.

- test_rule:slurm_extra="--qos='gpu_dev'"

Building DAG of jobs...
Retrieving input from storage.
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job          count
---------  -------
test_rule        1
total            1

Select jobs to execute...
InputFunctionException in rule test_rule in file /home/audanp/temp/deleme/Snakefile, line 3:
Error:
  WorkflowError:
    Failed to evaluate default resources value '--qos='gpu_dev'.
        String arguments may need additional quoting. Ex: --default-resources "tmpdir='/home/user/tmp'".
    SyntaxError: unterminated string literal (detected at line 1) (<string>, line 1)
Wildcards:

Traceback:
 (rule test_rule, line 3, /home/audanp/temp/deleme/Snakefile)

This is an example I have been mucking with to try to get the slurm plugin to run what was working Snakemake 7 and --slurm: - guppy_run:slurm_extra=-q gpu_training --gres gpu:v100:4

Yes, --queue should have been --qos, but it wasn't getting that far.

cmeesters commented 6 months ago

Oh, can you please try: <rule name>: slurm_extra="--qos='gpu_dev'" - note the space after the colon.

paudano commented 6 months ago

The error changes, but still reporting that it's missing the closing quote:

WorkflowError:
SLURM job submission failed. The error message was /bin/sh: -c: line 1: unexpected EOF while looking for matching `''
/bin/sh: -c: line 2: syntax error: unexpected end of file

Resources config.yaml:

---
set-resources:
  - test_rule:mem_mb=1024
  - test_rule:runtime=30
  - test_rule:slurm_partition=gpus
  - test_rule: slurm_extra="--qos='gpu_dev'"
cmeesters commented 6 months ago

this is no valid yaml-format either, try something like:

set-resources:
    test_rule:
        mem_mb: 1024
        runtime: 30
        slurm_partition: "gpus"
        slurm_extra: "--qos='gpu_dev' <further combinations"

Additional rules are described with the same indentation level as test_rule.

paudano commented 6 months ago

With:

---
set-resources:
    test_rule:
        mem_mb: 1024
        runtime: 30
        slurm_partition: "gpus"
        slurm_extra: "--qos='gpu_dev'"

I get the original SyntaxError again:

InputFunctionException in rule test_rule in file /home/audanp/temp/deleme/Snakefile, line 3:
Error:
  WorkflowError:
    Failed to evaluate default resources value '--qos='gpu_dev'.
        String arguments may need additional quoting. Ex: --default-resources "tmpdir='/home/user/tmp'".
    SyntaxError: unterminated string literal (detected at line 1) (<string>, line 1)
Wildcards:

Traceback:
 (rule test_rule, line 3, /home/audanp/temp/deleme/Snakefile)
cmeesters commented 5 months ago

Well, I created a minimal example to chase this. Seems to be a bug transporting the information to the SLURM job context.

Minimal example:

rule all:
     input: "results/b.out"

rule test1:
     output: "results/a.out"
     shell: "touch {output}"

rule test2:
     input: rules.test1.output
     output: "results/b.out"
     shell: "cp {input} {output}"

Profile:

$ cat profiles/config.yaml 
set-resources:
    test1:
        slurm_partition: "smp"
    test2:
        slurm_partition: "smp"

Call: $ snakemake --executor slurm -j 2 --workflow-profile ./profiles/ --default-resources slurm_account=m2_zdvhpc

Tries to hand to the jobstep a correct call, yet also the following (obtained by printf-debugging):

--set-resources 'test1:slurm_partition=<function eval_resource_expression.<locals>.callable at 0x7f7807d11940>' 'test2:slurm_partition=<function eval_resource_expression.<locals>.callable at 0x7f7807d11a80>'

Which cannot be evaluated.

I did not notice this at first, as default resources get transported fine (as far as I can see) and the workflow I was trying had sufficient defaults. Resources, however, fail to be set correctly in the jobstep context, due to the erroneous submission command line.

This is hence a bug in snakemake-core. We are on it.

paudano commented 5 months ago

Thank you!

brand-fabian commented 5 months ago

Hi Christian and Peter,

recently I also encountered this issue when attempting to migrate my workflow to snakemake v8 and I do not think that this issue is only in snakemake-core, however I think it is a complicated interaction between the core and this module. I tried to pass

default-resources:
    slurm_extra: "--gres=localtmp:10G"

to my workflow. Please note that the syntax with = is not necessary for the slurm command line, but is required for snakemake to parse the line correctly, but that is a bug/feature for another time.

From what I can see right now there are two potential sources of errors that arise when the resources get parsed:

  1. In snakemake-core/resources.py every resource statement is passed into the python eval function, so it must be valid python code. This is obviously true for most resource arguments (Integers and Generic strings), but in our case it is an assignment to a variable starting with --, which is invalid python. To avoid this bug therefore the whole string needs to be enclosed in quotes (double or single), which leads to the second issue...
  2. After the snakemake-core parsed the resources, this module gets the specifications and constructs a sbatch command line from them. In the first step there is nothing wrong, however all default resources are also passed to the sbatch command line inside the --wrap option which encloses its argument inside double quotes " (here). Because of this and because the slurm job will invoke a shell which adds another layer of parsing / escaping to the script, all double quotes get wiped from the string which re-triggers the parse error from step one, but now inside the job.

Iam currently trying two fixes for these problems and can update you or provide a PR if desired and if my workflow executes properly:

  1. I use the resource declaration slurm_extra: str("--gres=localtmp:10G"), which gets me around all the parser incompatibilities and since the original string is parsed around in subsequent command lines I figured it will usually stay the same.
  2. Amend L133 with
call += f' --wrap="{re.sub(r"([\"])", r"\\\1", exec_job)}"'
Ulthran commented 5 months ago

I think I'm running into the same problem trying to name jobs by their rules and output slurm logs to named files:

default-resources:
  - slurm_account="hpcusers"
  - slurm_partition="defq"
  - mem_mb=8000
  - runtime=15
  - disk_mb=1000
  - slurm_extra="--job-name=%x --output=slurm_%x_%j"

Different iterations on this either fail with the evaluate default resources value error or do replace the %x and %j but %x becomes the slurm assigned job name (some random alphanumeric string). How do I get the snakemake job names to come through here? This was working with snakemake 7.

w8jcik commented 5 months ago

The only workaround I could find is to move slurm_extra from your profile to resources of each job in Snakefile. Some other Slurm-specific switches might be also broken in the profile, then move them too.

Ulthran commented 5 months ago

Hmm that is not ideal. Do you know if there's an open issue/PR for this on snakemake core?

w8jcik commented 5 months ago

There is a merge request in this thread adding a test case. I am not familiar with organisation of this project, but if it is test driven development, then maybe fix goes later in the same merge request.

brand-fabian commented 5 months ago

@Ulthran I think changing the job name generated by this plugin can not work at all due to https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/5dfd514b3c246acbeab9257ae029c039105672a0/snakemake_executor_plugin_slurm/__init__.py#L76

which specifies the job name explicitly and sets it to a random uuid4 that is generated once on the first initialization of the workflow / executor object. I also think that this is not a particularly good idea, and dont really know why it is needed, perhaps one of the original authors can shed some light on that.

I would also much prefer to be able to set the slurm_extra params in a workflow-profile instead of the rules directly, especially since this parameter encompasses many needed slurm flags (qos, gres, etc.).

cmeesters commented 5 months ago

This is fixed in v8.4.3 - as mentioned in the linked PR here above. However, another fix is forthcoming. Stay tuned.

cmeesters commented 5 months ago

v8.4.4 is out. Alas, I cannot test it right now, for the lack of a cluster (maintenance). You might want to give it a try.

paudano commented 5 months ago

I see the same error as before with Snakemake 8.4.4 and slurm executor plugin 0.3.0.

Versions:

snakemake                 8.4.4                hdfd78af_0    bioconda
snakemake-executor-plugin-slurm 0.3.0              pyhdfd78af_0    bioconda

Error:

InputFunctionException in rule test_rule in file /pod/2/beck-lab/audanp/projects/2024_Snake_Slurm_Extra/Snakefile, line 3:
Error:
  WorkflowError:
    Failed to evaluate resources value '--qos="gpu_dev"'.
        String arguments may need additional quoting. E.g.: --default-resources "tmpdir='/home/user/tmp'".
    SyntaxError: invalid syntax (<string>, line 1)
Ulthran commented 5 months ago

I could be missing something but I'm still not able to access the rule name to pass through to slurm_extra here. I've made a workaround for it that's a little hacky but seems to be working with this update, added this to the bottom of my Snakefile:

for rule in rules._rules:
    slurm_extra = f"--job-name={rule} --output=slurm_{rule}_%j"
    getattr(rules, rule).rule.resources["slurm_extra"] = slurm_extra

EDIT: Don't set the job-name like this, the executor relies on the name it applies to determine when jobs have finished. Here is an updated version that gives a new output name and new comment:

for rule_name in rules._rules:
    rule_obj = getattr(rules, rule_name).rule
    wcs = rule_obj._wildcard_names
    if wcs:
        rule_obj.resources["slurm_extra"] = (
            lambda wc, rule_name=rule_name: f"--output=slurm_{rule_name}_{'_'.join(wc)}_%j --comment={rule_name}_{'_'.join(wc)}"
        )
    else:
        rule_obj.resources["slurm_extra"] = f"--output=slurm_{rule_name}_%j --comment={rule_name}"

Once #35 is included in a release, this shouldn't be necessary anymore.

w8jcik commented 5 months ago

v8.4.4 is out. Alas, I cannot test it right now, for the lack of a cluster (maintenance). You might want to give it a try.

Previously it was failing before submission to Slurm. Now an error appears when the job gets to a node.

cmeesters commented 5 months ago

Ah, it took me to understand this Failed to evaluate resources value '--qos="gpu_dev"' for I did not read carefully and was a bit occupied yesterday (in fact, I still am :-( ). So, the thing is, that the parsing now works, but of course "--qos" is no argument to snakemake. --set-resources "slurm_extra='...'" would be a valid argument. With other words, the snakemake core strips away the slurm_extra and we are back on square 1.

blaiseli commented 4 months ago

@Ulthran A not very satisfactory workaround for the hard-coded job name might be to use job comments instead:

https://github.com/Snakemake-Profiles/slurm/issues/117#issuecomment-1938894912

Edit: I hadn't seen your later post with a better hack. Thanks for sharing.

cmeesters commented 4 months ago

I think I have an idea. PR is in preparation.

Ulthran commented 4 months ago

> [Ulthran](/Ulthran)

@blaiseli I just edited that comment. Note that setting the job-name will break the executor's ability to tell when jobs finish.

cmeesters commented 4 months ago

@all please update Snakemake and this executor plugin to the newest releases and try again.

paudano commented 4 months ago

I am still getting parsing errors trying to set --qos through slurm_extra. Was this update just related to the job name issues? I also couldn't get it to replace the job name (jobs are submitted as "%x" to Slurm).

As a side note, what's the best way to install the plugin from the git repo (latest version on bioconda is 0.3.0)? I created a minimal setup.py, and that seems to work.

Here's the snakemake packages I have in my test conda environment (installed dependencies, then snakemake-executor-plugin-slurm from the github repos with setup.py):

snakemake 8.5.3 pypi_0 pypi snakemake-executor-plugin-slurm-jobstep 0.1.10 pyhdfd78af_0 bioconda snakemake-interface-common 1.17.1 pypi_0 pypi snakemake-interface-executor-plugins 8.2.0 pypi_0 pypi snakemake-interface-report-plugins 1.0.0 pyhdfd78af_0 bioconda snakemake-interface-storage-plugins 3.1.0 pypi_0 pypi snakemake-minimal 8.5.3 pyhdfd78af_0 bioconda

Package `snakemake-exeucutor-plugin-slurm' doesn't show up here, but snakemake only finds the plugin after I install from the repo (commit 3b6a3592413722e17c1f310ca7958f206a91e5cd).

Let me know if I'm testing something incorrectly. Thanks!

cmeesters commented 4 months ago

snakemake-executor-plugin-slurm is available via bioconda. That reminds me: The minimal software stack in my docs-PR is missing ...

Anyway, for me, slurm_extra works like a charm. The main parsing issue was in the Snakemake core. Several attempts to fix it, lead to yet another side effect. Yet, since we had two rather obscure issues, I asked to update the executor, too.

Now, whether you install the software stack with conda/mamba/micromamba or poetry or write your own setup scripts - as long as everything is part of one environment, it should be fine. You can check whether all packages are under one site-packages folder, yourself.

My workflow profile for testing:

default-resources:
    slurm_partition: "smp"
    tasks: 1

set-resources:
    test1:
        slurm_partition: "m2_gpu"
        runtime: 5
        slurm_extra: "'--nice=150 --gres=gpu:1'"

Notwithstanding different partition, accounts and the little fact that you are using a --qos-flag, that should work as intended.

Please run your workflow with --verbose, paste the command line, any profile file and attach the log to your reply. I'm sure we can hunt the issue down, eventually.

w8jcik commented 4 months ago

Submission still fails.

I am testing following profile

executor: slurm

default-resources:
  slurm_partition: some-partition
  tasks: 1
  threads: 10
  slurm_extra: "'--gres=gpu:1'"

With following Snakemake packages

snakemake@8.5.3
snakemake-executor-plugin-slurm@0.4.1
snakemake-executor-plugin-slurm-jobstep@0.1.10
snakemake-interface-common@1.17.1
snakemake-interface-executor-plugins@8.2.0
snakemake-interface-storage-plugins@3.1.0

Slurm 23.11.4

Submission to Slurm fails after it reaches a node

snakemake force_field
Error in rule force_field:
    message: SLURM-job '9731003' failed, SLURM status is: 'FAILED'For further error details see the cluster/cloud log and the log files of the involved rule(s).
InputFunctionException in rule force_field in file [obfuscated]/Snakefile, line 6:
Error:
  WorkflowError:
    Failed to evaluate resources value '--gres=gpu:1'.
        String arguments may need additional quoting. E.g.: --default-resources "tmpdir='/home/user/tmp'" or --set-resources "somerule:someresource='--nice=100'". This also holds for setting resources inside of a profile, where you might have to enclose them in single and double quotes, i.e. someresource: "'--nice=100'".
    SyntaxError: invalid syntax (<string>, line 1)

Running with --verbose

sbatch call: sbatch --job-name e4d0cd06-d884-4f24-bba2-f8b2141d036b --output [obfuscated]/%j.log --export=ALL --comment force_field -A default -p some-partition --mem 1000 --ntasks=1 --cpus-per-task=1 --gres=gpu:1 -D [obfuscated] --wrap="[obfuscated]/python3.11 -m snakemake --snakefile '/[obfuscated]/Snakefile' --target-jobs 'force_field:' --allowed-rules 'force_field' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1000' 'mem_mib=954' 'disk_mb=1000' 'disk_mib=954' 'tasks=1' 'threads=10' --wait-for-files '[obfuscated]/.snakemake/tmp.5pr3uo7v' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --keep-incomplete  --verbose  --rerun-triggers mtime --conda-frontend 'mamba' --shared-fs-usage persistence input-output source-cache sources storage-local-copies software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 10 --scheduler 'greedy' --scheduler-solver-path '[obfuscated]' --default-resources 'mem_mb=min(max(2*input.size_mb, 1000), 8000)' 'disk_mb=max(2*input.size_mb, 1000)' 'tmpdir=system_tmpdir' 'slurm_partition=[obfuscated]' 'tasks=1' 'threads=10' "slurm_extra='--gres=gpu:1'" --executor slurm-jobstep --jobs 1 --mode 'remote'"

he job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-28T17:00 --endtime now --name e4d0cd06-d884-4f24-bba2-f8b2141d036b
It took: 0.061034202575683594 seconds
The output is:
'9731006|PENDING'

status_of_jobs after sacct is: {'9731006': 'PENDING'}
active_jobs_ids_with_current_sacct_status are: {'9731006'}
active_jobs_seen_by_sacct are: {'9731006'}
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-28T17:00 --endtime now --name e4d0cd06-d884-4f24-bba2-f8b2141d036b
It took: 0.06168413162231445 seconds
The output is:
'9731006|PENDING'

status_of_jobs after sacct is: {'9731006': 'PENDING'}
active_jobs_ids_with_current_sacct_status are: {'9731006'}
active_jobs_seen_by_sacct are: {'9731006'}
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-28T17:00 --endtime now --name e4d0cd06-d884-4f24-bba2-f8b2141d036b
It took: 0.05933403968811035 seconds
The output is:
'9731006|FAILED'

status_of_jobs after sacct is: {'9731006': 'FAILED'}
active_jobs_ids_with_current_sacct_status are: {'9731006'}
active_jobs_seen_by_sacct are: {'9731006'}
missing_sacct_status are: set()
Full Traceback (most recent call last):
  File "[obfuscated]/.spack-env/view/lib/python3.11/site-packages/snakemake/resources.py", line 533, in generic_callable
    value = eval(
            ^^^^^
  File "<string>", line 1
    --gres=gpu:1
          ^
SyntaxError: invalid syntax

Identical profile is fine with Snakemake 7.

paudano commented 2 months ago

I was able to get this working with the latest Snakemake and slurm plugin:

slurm_extra: "'--qos=gpu_inference' '--gres=gpu:v100:4'"

Seems to schedule correctly. If I single-quote around the whole string, Slurm fails.

Again, thank you. It looks like I'll be able to push my GPU workflows now.

cmeesters commented 2 months ago

Hm, weird - I tried with my cluster and it works fine. Probably worth investigation further.

Anyway, I am curious to learn about your workflow and hope you push it to the snakemake workflow catalogue. I just taught a course on a cluster with different logins for gpu-based work and cpu-based work. This is totally unusual and would break any combined workflow - hence an existing curated workflow, may probably convince those colleagues of mine to reconsider their setup.