Open cmeesters opened 1 month ago
my assumption appears to be wrong: the remote flag is assigned to the actual Snakemake process running on the compute node.
Mhm, unsure what is wrong there. Maybe we should have a call to sort that out. Perhaps after our call this Friday.
One important thing to see would be the log of the slurm job that fails.
The shell command in the error is likely misleading, because it is formatted with the local representations of input and output files (as if the job would not run in slurm). We should at least add a disclaimer to the shell command that is printed in the error case.
ah, yes, of course.
As to the log file:
WorkflowError:
Failed to create local storage prefix /localscratch/fs
PermissionError: [Errno 13] Permission denied: '/localscratch/fs'
File "/gpfs/fs1/home/meesters/projects/hpc-jgu-lifescience/snakemake-interface-storage-plugins/snakemake_interface_storage_plugins/storage_provider.py", line 67, in __init__
This is the slurm log - of course, the path /localscratch/fs
does not exist. It ought to be /localscratch/15703262
with my particular job id at the time.
The Snakemake log is like the one pasted above, just more boilerplate. I noticed:
Building DAG of jobs...
SLURM run ID: c32a7540-b225-401c-8ce3-7916a4fd0115
Using shell: /usr/bin/bash
Provided remote nodes: 9223372036854775807
what is this insane number for the provided remote nodes? (no, our cluster is slightly smaller ;-) ).
Cheers Christian
The problem was a premature replacement (by an empty string) of the slurm jobid envvar. The fix is here: https://github.com/snakemake/snakemake/pull/2943. Basically, we now just use the base64-encoding mechanism of Snakemake CLI to hide eventual envvars from being evaluated by the shell when they are passed to the cluster backend.
Great! Would you like to wait for another release and gather more fixes or features - or just release?
I'm afraid, the issue was closed prematurely. It persists with Snakemake version 8.15.2.
Hoi,
after a long while, I tested again and apparently the previous fix stopped working. That does not make any sense, so probably, I am doing something wrong.
With
and a workflow like:
I get:
My assumption is that, the
local-storage-prefix
is local in the SLUM job, as the CPU-executor is oblivious to be running in a job.I also observe, that the
fs
indefault-storage-provider
is a literal in attempted path.The intended behaviour would be, that the input is copied to
remote-job-local-storage-prefix
and any output back to the actual (relative) path(s).