snakemake / snakemake-storage-plugin-fs

A Snakemake storage plugin that reads and writes from a locally mounted filesystem using rsync.
MIT License
2 stars 1 forks source link

'fs' gets a literal place in the path string #19

Open cmeesters opened 1 month ago

cmeesters commented 1 month ago

Hoi,

after a long while, I tested again and apparently the previous fix stopped working. That does not make any sense, so probably, I am doing something wrong.

With

$ cat ~/.config/snakemake/config.yaml 
executor: slurm
latency-wait: 5
default-storage-provider: fs
shared-fs-usage:
  - persistence
  - sources
  - source-cache
remote-job-local-storage-prefix: /localscratch/$SLURM_JOB_ID
local-storage-prefix: /dev/shm/$USER

and a workflow like:

localrules: produce_input,

rule all:
     input: "results/a.out"

rule produce_input:
     output: temp("foo.txt")
     shell: "touch {output}"

rule stagein_test:
     input: rules.produce_input.output
     output: "results/a.out"
     shell: """cp {input} {output};
       echo $(realpath {input}) >> {output}
     """

I get:

Error in rule stagein_test:
    message: SLURM-job '15703262' failed, SLURM status is: 'FAILED'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 1
    input: foo.txt (retrieve from storage)
    output: results/a.out (send to storage)
    log: /gpfs/fs1/home/meesters/projects/hpc-jgu-lifescience/snakemake-workflows/test-workflow/.snakemake/slurm_logs/rule_stagein_test/15703262.log (check log file(s) for error details)
    shell:
        cp /dev/shm/meesters/fs/foo.txt /dev/shm/meesters/fs/results/a.out;
       echo $(realpath /dev/shm/meesters/fs/foo.txt) >> /dev/shm/meesters/fs/results/a.out

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 15703262

My assumption is that, the local-storage-prefix is local in the SLUM job, as the CPU-executor is oblivious to be running in a job.

I also observe, that the fs in default-storage-provider is a literal in attempted path.

The intended behaviour would be, that the input is copied to remote-job-local-storage-prefix and any output back to the actual (relative) path(s).

cmeesters commented 1 month ago

my assumption appears to be wrong: the remote flag is assigned to the actual Snakemake process running on the compute node.

johanneskoester commented 3 weeks ago

Mhm, unsure what is wrong there. Maybe we should have a call to sort that out. Perhaps after our call this Friday.

johanneskoester commented 3 weeks ago

One important thing to see would be the log of the slurm job that fails.

johanneskoester commented 3 weeks ago

The shell command in the error is likely misleading, because it is formatted with the local representations of input and output files (as if the job would not run in slurm). We should at least add a disclaimer to the shell command that is printed in the error case.

cmeesters commented 3 weeks ago

ah, yes, of course.

As to the log file:

WorkflowError:
Failed to create local storage prefix /localscratch/fs
PermissionError: [Errno 13] Permission denied: '/localscratch/fs'
  File "/gpfs/fs1/home/meesters/projects/hpc-jgu-lifescience/snakemake-interface-storage-plugins/snakemake_interface_storage_plugins/storage_provider.py", line 67, in __init__

This is the slurm log - of course, the path /localscratch/fs does not exist. It ought to be /localscratch/15703262 with my particular job id at the time.

The Snakemake log is like the one pasted above, just more boilerplate. I noticed:

 Building DAG of jobs...
SLURM run ID: c32a7540-b225-401c-8ce3-7916a4fd0115
Using shell: /usr/bin/bash
Provided remote nodes: 9223372036854775807

what is this insane number for the provided remote nodes? (no, our cluster is slightly smaller ;-) ).

Cheers Christian

johanneskoester commented 3 weeks ago

The problem was a premature replacement (by an empty string) of the slurm jobid envvar. The fix is here: https://github.com/snakemake/snakemake/pull/2943. Basically, we now just use the base64-encoding mechanism of Snakemake CLI to hide eventual envvars from being evaluated by the shell when they are passed to the cluster backend.

cmeesters commented 3 weeks ago

Great! Would you like to wait for another release and gather more fixes or features - or just release?

cmeesters commented 2 weeks ago

I'm afraid, the issue was closed prematurely. It persists with Snakemake version 8.15.2.