Closed paudano closed 2 months ago
Please try: slurm_extra="--queue='gpu_dev'"
like the help message to the error suggests. However: Note, that SLURM does not know a --queue
flag.
Of course I tried that.
- test_rule:slurm_extra="--qos='gpu_dev'"
Building DAG of jobs...
Retrieving input from storage.
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job count
--------- -------
test_rule 1
total 1
Select jobs to execute...
InputFunctionException in rule test_rule in file /home/audanp/temp/deleme/Snakefile, line 3:
Error:
WorkflowError:
Failed to evaluate default resources value '--qos='gpu_dev'.
String arguments may need additional quoting. Ex: --default-resources "tmpdir='/home/user/tmp'".
SyntaxError: unterminated string literal (detected at line 1) (<string>, line 1)
Wildcards:
Traceback:
(rule test_rule, line 3, /home/audanp/temp/deleme/Snakefile)
This is an example I have been mucking with to try to get the slurm plugin to run what was working Snakemake 7 and --slurm
:
- guppy_run:slurm_extra=-q gpu_training --gres gpu:v100:4
Yes, --queue
should have been --qos
, but it wasn't getting that far.
Oh, can you please try: <rule name>: slurm_extra="--qos='gpu_dev'"
- note the space after the colon.
The error changes, but still reporting that it's missing the closing quote:
WorkflowError:
SLURM job submission failed. The error message was /bin/sh: -c: line 1: unexpected EOF while looking for matching `''
/bin/sh: -c: line 2: syntax error: unexpected end of file
Resources config.yaml
:
---
set-resources:
- test_rule:mem_mb=1024
- test_rule:runtime=30
- test_rule:slurm_partition=gpus
- test_rule: slurm_extra="--qos='gpu_dev'"
this is no valid yaml-format either, try something like:
set-resources:
test_rule:
mem_mb: 1024
runtime: 30
slurm_partition: "gpus"
slurm_extra: "--qos='gpu_dev' <further combinations"
Additional rules are described with the same indentation level as test_rule
.
With:
---
set-resources:
test_rule:
mem_mb: 1024
runtime: 30
slurm_partition: "gpus"
slurm_extra: "--qos='gpu_dev'"
I get the original SyntaxError
again:
InputFunctionException in rule test_rule in file /home/audanp/temp/deleme/Snakefile, line 3:
Error:
WorkflowError:
Failed to evaluate default resources value '--qos='gpu_dev'.
String arguments may need additional quoting. Ex: --default-resources "tmpdir='/home/user/tmp'".
SyntaxError: unterminated string literal (detected at line 1) (<string>, line 1)
Wildcards:
Traceback:
(rule test_rule, line 3, /home/audanp/temp/deleme/Snakefile)
Well, I created a minimal example to chase this. Seems to be a bug transporting the information to the SLURM job context.
Minimal example:
rule all:
input: "results/b.out"
rule test1:
output: "results/a.out"
shell: "touch {output}"
rule test2:
input: rules.test1.output
output: "results/b.out"
shell: "cp {input} {output}"
Profile:
$ cat profiles/config.yaml
set-resources:
test1:
slurm_partition: "smp"
test2:
slurm_partition: "smp"
Call: $ snakemake --executor slurm -j 2 --workflow-profile ./profiles/ --default-resources slurm_account=m2_zdvhpc
Tries to hand to the jobstep a correct call, yet also the following (obtained by printf-debugging):
--set-resources 'test1:slurm_partition=<function eval_resource_expression.<locals>.callable at 0x7f7807d11940>' 'test2:slurm_partition=<function eval_resource_expression.<locals>.callable at 0x7f7807d11a80>'
Which cannot be evaluated.
I did not notice this at first, as default resources get transported fine (as far as I can see) and the workflow I was trying had sufficient defaults. Resources, however, fail to be set correctly in the jobstep context, due to the erroneous submission command line.
This is hence a bug in snakemake-core. We are on it.
Thank you!
Hi Christian and Peter,
recently I also encountered this issue when attempting to migrate my workflow to snakemake v8 and I do not think that this issue is only in snakemake-core, however I think it is a complicated interaction between the core and this module. I tried to pass
default-resources:
slurm_extra: "--gres=localtmp:10G"
to my workflow. Please note that the syntax with =
is not necessary for the slurm command line, but is required for snakemake to parse the line correctly, but that is a bug/feature for another time.
From what I can see right now there are two potential sources of errors that arise when the resources get parsed:
eval
function, so it must be valid python code. This is obviously true for most resource arguments (Integers and Generic strings), but in our case it is an assignment to a variable starting with --
, which is invalid python. To avoid this bug therefore the whole string needs to be enclosed in quotes (double or single), which leads to the second issue...sbatch
command line from them. In the first step there is nothing wrong, however all default resources are also passed to the sbatch
command line inside the --wrap
option which encloses its argument inside double quotes "
(here). Because of this and because the slurm job will invoke a shell which adds another layer of parsing / escaping to the script, all double quotes get wiped from the string which re-triggers the parse error from step one, but now inside the job.Iam currently trying two fixes for these problems and can update you or provide a PR if desired and if my workflow executes properly:
slurm_extra: str("--gres=localtmp:10G")
, which gets me around all the parser incompatibilities and since the original string is parsed around in subsequent command lines I figured it will usually stay the same.call += f' --wrap="{re.sub(r"([\"])", r"\\\1", exec_job)}"'
I think I'm running into the same problem trying to name jobs by their rules and output slurm logs to named files:
default-resources:
- slurm_account="hpcusers"
- slurm_partition="defq"
- mem_mb=8000
- runtime=15
- disk_mb=1000
- slurm_extra="--job-name=%x --output=slurm_%x_%j"
Different iterations on this either fail with the evaluate default resources value error or do replace the %x and %j but %x becomes the slurm assigned job name (some random alphanumeric string). How do I get the snakemake job names to come through here? This was working with snakemake 7.
The only workaround I could find is to move slurm_extra
from your profile to resources
of each job in Snakefile
. Some other Slurm-specific switches might be also broken in the profile, then move them too.
Hmm that is not ideal. Do you know if there's an open issue/PR for this on snakemake core?
There is a merge request in this thread adding a test case. I am not familiar with organisation of this project, but if it is test driven development, then maybe fix goes later in the same merge request.
@Ulthran I think changing the job name generated by this plugin can not work at all due to https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/5dfd514b3c246acbeab9257ae029c039105672a0/snakemake_executor_plugin_slurm/__init__.py#L76
which specifies the job name explicitly and sets it to a random uuid4 that is generated once on the first initialization of the workflow / executor object. I also think that this is not a particularly good idea, and dont really know why it is needed, perhaps one of the original authors can shed some light on that.
I would also much prefer to be able to set the slurm_extra
params in a workflow-profile instead of the rules directly, especially since this parameter encompasses many needed slurm flags (qos
, gres
, etc.).
This is fixed in v8.4.3 - as mentioned in the linked PR here above. However, another fix is forthcoming. Stay tuned.
v8.4.4 is out. Alas, I cannot test it right now, for the lack of a cluster (maintenance). You might want to give it a try.
I see the same error as before with Snakemake 8.4.4 and slurm executor plugin 0.3.0.
Versions:
snakemake 8.4.4 hdfd78af_0 bioconda
snakemake-executor-plugin-slurm 0.3.0 pyhdfd78af_0 bioconda
Error:
InputFunctionException in rule test_rule in file /pod/2/beck-lab/audanp/projects/2024_Snake_Slurm_Extra/Snakefile, line 3:
Error:
WorkflowError:
Failed to evaluate resources value '--qos="gpu_dev"'.
String arguments may need additional quoting. E.g.: --default-resources "tmpdir='/home/user/tmp'".
SyntaxError: invalid syntax (<string>, line 1)
I could be missing something but I'm still not able to access the rule name to pass through to slurm_extra
here. I've made a workaround for it that's a little hacky but seems to be working with this update, added this to the bottom of my Snakefile:
for rule in rules._rules:
slurm_extra = f"--job-name={rule} --output=slurm_{rule}_%j"
getattr(rules, rule).rule.resources["slurm_extra"] = slurm_extra
EDIT: Don't set the job-name
like this, the executor relies on the name it applies to determine when jobs have finished. Here is an updated version that gives a new output name and new comment:
for rule_name in rules._rules:
rule_obj = getattr(rules, rule_name).rule
wcs = rule_obj._wildcard_names
if wcs:
rule_obj.resources["slurm_extra"] = (
lambda wc, rule_name=rule_name: f"--output=slurm_{rule_name}_{'_'.join(wc)}_%j --comment={rule_name}_{'_'.join(wc)}"
)
else:
rule_obj.resources["slurm_extra"] = f"--output=slurm_{rule_name}_%j --comment={rule_name}"
Once #35 is included in a release, this shouldn't be necessary anymore.
v8.4.4 is out. Alas, I cannot test it right now, for the lack of a cluster (maintenance). You might want to give it a try.
Previously it was failing before submission to Slurm. Now an error appears when the job gets to a node.
Ah, it took me to understand this Failed to evaluate resources value '--qos="gpu_dev"'
for I did not read carefully and was a bit occupied yesterday (in fact, I still am :-( ). So, the thing is, that the parsing now works, but of course "--qos" is no argument to snakemake. --set-resources "slurm_extra='...'"
would be a valid argument. With other words, the snakemake core strips away the slurm_extra
and we are back on square 1.
@Ulthran A not very satisfactory workaround for the hard-coded job name might be to use job comments instead:
https://github.com/Snakemake-Profiles/slurm/issues/117#issuecomment-1938894912
Edit: I hadn't seen your later post with a better hack. Thanks for sharing.
I think I have an idea. PR is in preparation.
@blaiseli I just edited that comment. Note that setting the job-name
will break the executor's ability to tell when jobs finish.
@all please update Snakemake and this executor plugin to the newest releases and try again.
I am still getting parsing errors trying to set --qos through slurm_extra. Was this update just related to the job name issues? I also couldn't get it to replace the job name (jobs are submitted as "%x" to Slurm).
As a side note, what's the best way to install the plugin from the git repo (latest version on bioconda is 0.3.0)? I created a minimal setup.py, and that seems to work.
Here's the snakemake packages I have in my test conda environment (installed dependencies, then snakemake-executor-plugin-slurm from the github repos with setup.py):
snakemake 8.5.3 pypi_0 pypi snakemake-executor-plugin-slurm-jobstep 0.1.10 pyhdfd78af_0 bioconda snakemake-interface-common 1.17.1 pypi_0 pypi snakemake-interface-executor-plugins 8.2.0 pypi_0 pypi snakemake-interface-report-plugins 1.0.0 pyhdfd78af_0 bioconda snakemake-interface-storage-plugins 3.1.0 pypi_0 pypi snakemake-minimal 8.5.3 pyhdfd78af_0 bioconda
Package `snakemake-exeucutor-plugin-slurm' doesn't show up here, but snakemake only finds the plugin after I install from the repo (commit 3b6a3592413722e17c1f310ca7958f206a91e5cd).
Let me know if I'm testing something incorrectly. Thanks!
snakemake-executor-plugin-slurm
is available via bioconda. That reminds me: The minimal software stack in my docs-PR is missing ...
Anyway, for me, slurm_extra
works like a charm. The main parsing issue was in the Snakemake core. Several attempts to fix it, lead to yet another side effect. Yet, since we had two rather obscure issues, I asked to update the executor, too.
Now, whether you install the software stack with conda/mamba/micromamba or poetry or write your own setup scripts - as long as everything is part of one environment, it should be fine. You can check whether all packages are under one site-packages
folder, yourself.
My workflow profile for testing:
default-resources:
slurm_partition: "smp"
tasks: 1
set-resources:
test1:
slurm_partition: "m2_gpu"
runtime: 5
slurm_extra: "'--nice=150 --gres=gpu:1'"
Notwithstanding different partition, accounts and the little fact that you are using a --qos
-flag, that should work as intended.
Please run your workflow with --verbose
, paste the command line, any profile file and attach the log to your reply. I'm sure we can hunt the issue down, eventually.
Submission still fails.
I am testing following profile
executor: slurm
default-resources:
slurm_partition: some-partition
tasks: 1
threads: 10
slurm_extra: "'--gres=gpu:1'"
With following Snakemake packages
snakemake@8.5.3
snakemake-executor-plugin-slurm@0.4.1
snakemake-executor-plugin-slurm-jobstep@0.1.10
snakemake-interface-common@1.17.1
snakemake-interface-executor-plugins@8.2.0
snakemake-interface-storage-plugins@3.1.0
Slurm 23.11.4
Submission to Slurm fails after it reaches a node
snakemake force_field
Error in rule force_field:
message: SLURM-job '9731003' failed, SLURM status is: 'FAILED'For further error details see the cluster/cloud log and the log files of the involved rule(s).
InputFunctionException in rule force_field in file [obfuscated]/Snakefile, line 6:
Error:
WorkflowError:
Failed to evaluate resources value '--gres=gpu:1'.
String arguments may need additional quoting. E.g.: --default-resources "tmpdir='/home/user/tmp'" or --set-resources "somerule:someresource='--nice=100'". This also holds for setting resources inside of a profile, where you might have to enclose them in single and double quotes, i.e. someresource: "'--nice=100'".
SyntaxError: invalid syntax (<string>, line 1)
Running with --verbose
sbatch call: sbatch --job-name e4d0cd06-d884-4f24-bba2-f8b2141d036b --output [obfuscated]/%j.log --export=ALL --comment force_field -A default -p some-partition --mem 1000 --ntasks=1 --cpus-per-task=1 --gres=gpu:1 -D [obfuscated] --wrap="[obfuscated]/python3.11 -m snakemake --snakefile '/[obfuscated]/Snakefile' --target-jobs 'force_field:' --allowed-rules 'force_field' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1000' 'mem_mib=954' 'disk_mb=1000' 'disk_mib=954' 'tasks=1' 'threads=10' --wait-for-files '[obfuscated]/.snakemake/tmp.5pr3uo7v' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --keep-incomplete --verbose --rerun-triggers mtime --conda-frontend 'mamba' --shared-fs-usage persistence input-output source-cache sources storage-local-copies software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 10 --scheduler 'greedy' --scheduler-solver-path '[obfuscated]' --default-resources 'mem_mb=min(max(2*input.size_mb, 1000), 8000)' 'disk_mb=max(2*input.size_mb, 1000)' 'tmpdir=system_tmpdir' 'slurm_partition=[obfuscated]' 'tasks=1' 'threads=10' "slurm_extra='--gres=gpu:1'" --executor slurm-jobstep --jobs 1 --mode 'remote'"
he job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-28T17:00 --endtime now --name e4d0cd06-d884-4f24-bba2-f8b2141d036b
It took: 0.061034202575683594 seconds
The output is:
'9731006|PENDING'
status_of_jobs after sacct is: {'9731006': 'PENDING'}
active_jobs_ids_with_current_sacct_status are: {'9731006'}
active_jobs_seen_by_sacct are: {'9731006'}
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-28T17:00 --endtime now --name e4d0cd06-d884-4f24-bba2-f8b2141d036b
It took: 0.06168413162231445 seconds
The output is:
'9731006|PENDING'
status_of_jobs after sacct is: {'9731006': 'PENDING'}
active_jobs_ids_with_current_sacct_status are: {'9731006'}
active_jobs_seen_by_sacct are: {'9731006'}
missing_sacct_status are: set()
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-28T17:00 --endtime now --name e4d0cd06-d884-4f24-bba2-f8b2141d036b
It took: 0.05933403968811035 seconds
The output is:
'9731006|FAILED'
status_of_jobs after sacct is: {'9731006': 'FAILED'}
active_jobs_ids_with_current_sacct_status are: {'9731006'}
active_jobs_seen_by_sacct are: {'9731006'}
missing_sacct_status are: set()
Full Traceback (most recent call last):
File "[obfuscated]/.spack-env/view/lib/python3.11/site-packages/snakemake/resources.py", line 533, in generic_callable
value = eval(
^^^^^
File "<string>", line 1
--gres=gpu:1
^
SyntaxError: invalid syntax
Identical profile is fine with Snakemake 7.
I was able to get this working with the latest Snakemake and slurm plugin:
slurm_extra: "'--qos=gpu_inference' '--gres=gpu:v100:4'"
Seems to schedule correctly. If I single-quote around the whole string, Slurm fails.
Again, thank you. It looks like I'll be able to push my GPU workflows now.
Hm, weird - I tried with my cluster and it works fine. Probably worth investigation further.
Anyway, I am curious to learn about your workflow and hope you push it to the snakemake workflow catalogue. I just taught a course on a cluster with different logins for gpu-based work and cpu-based work. This is totally unusual and would break any combined workflow - hence an existing curated workflow, may probably convince those colleagues of mine to reconsider their setup.
When parsing a
slurm_extra
resource, I getSyntaxError: unterminated string literal
.In my profile, I have
set-resources
with- test_rule:slurm_extra=--queue='gpu_dev'
. The error message says:The error message has imbalanced single-quotes. The first and last are part of the error-handling code, so it looks like it's trying to eval
--queue='gpu_dev
. I'm not sure where the quotes are being stripped, and trying to "trick" it by adding additional quotes to the end of theslurm_extra
parameter doesn't get around it.Snakemake: 8.1.0 snakemake-executor-plugin-slurm: 0.1.4 python: 3.12.1
Command and output:
Attached Snakemake and profile: unterminated_issue.tar.gz