snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
9 stars 13 forks source link

`--cache` is ignored when using the executor #46

Open jlanga opened 4 months ago

jlanga commented 4 months ago

It would be an awesome feature to recycle past computations via --cache.

I have tried using together the slurm executor and --cache, but the latter is completely ignored, even when the results are already in the cache.

cmeesters commented 4 months ago

It would be an awesome feature ...

Oh yes! And a but otherwise.

However, can you tell us: Is the cache directory available upon submit time and the same within job context?

If so, can you please prepare a minimal example and run with --verbose --cache ...?

I'll be on a conference the next few days - will try to look into this issue, yet it might take a bit longer.

jlanga commented 4 months ago

Sure!

This is the example snakefile:

rule cache:
    output: "cached.txt"
    cache: True
    resources:
        mem_mb=500,
        runtime=5,
    shell: "touch {output}"

rule all:
    input: "cached.txt"
    output: "all.txt"
    resources:
        mem_mb=500,
        runtime=5
    shell: "touch {output}"   

Setup

mkdir --parents snakecache
chmod ugo+rwx snakecache
export SNAKEMAKE_OUTPUT_CACHE="$(pwd)/snakecache"
snakemake --version                    
8.5.3

The output is not sent to the cache

snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: 249f4012-3b8d-43fd-b354-bfd1af4719e1
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job      count
-----  -------
all          1
cache        1
total        2

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:25:57 2024]
rule cache:
    output: cached.txt
    jobid: 1
    reason: Missing output files: cached.txt
    resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5

No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 1 has been submitted with SLURM jobid 13901748 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13901748.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09672927856445312 seconds
The output is:
'13901748|COMPLETED
'

status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901748'}
active_jobs_seen_by_sacct are: {'13901748'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09421706199645996 seconds
The output is:
'13901748|COMPLETED
'

status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar  4 10:26:57 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:26:57 2024]
rule all:
    input: cached.txt
    output: all.txt
    jobid: 0
    reason: Missing output files: all.txt; Input files updated by another job: cached.txt
    resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5

sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/%j.log --export=ALL --comment all -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'all:' --allowed-rules 'all' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' 'cached.txt' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13901777 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/13901777.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.12538862228393555 seconds
The output is:
'13901748|COMPLETED
13901777|COMPLETED
'

status_of_jobs after sacct is: {'13901748': 'COMPLETED', '13901777': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901777'}
active_jobs_seen_by_sacct are: {'13901777'}
missing_sacct_status are: set()
Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar  4 10:27:08 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T102556.662489.snakemake.log
unlocking
removing lock
removing lock
removed all locks

The job finishes, but the cache is empty:

ls -l snakecache
total 0

Running the job locally to cache it

rm -fv all.txt cached.txt
removed 'all.txt'
removed 'cached.txt'
 snakemake --cache cache -j 1 -c 1 --verbose --latency-wait 60 all 
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job      count
-----  -------
all          1
cache        1
total        2

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:49:12 2024]
localrule cache:
    output: cached.txt
    jobid: 1
    reason: Missing output files: cached.txt
    resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5

Moving output file cached.txt to cache.
Symlinking output file cached.txt from cache.
Completion of job ['cache'] reported to scheduler.
jobs registered as running before removal {cache}
[Mon Mar  4 10:49:12 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:49:12 2024]
localrule all:
    input: cached.txt
    output: all.txt
    jobid: 0
    reason: Missing output files: all.txt; Input files updated by another job: cached.txt
    resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5

Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar  4 10:49:12 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T104912.299719.snakemake.log
unlocking
removing lock
removing lock
removed all locks

The cache:

ll snakecache
-rw-rw-rw-. 1 jlanga jlanga          0 Mar  4 10:49 d5281ea3cbea5533581d362bf3375f0d61f076708f54dee5360dba560de33b1b

cached.txt is a symlink to the cached file.

Delete the output files to test caching again:

rm -fv all.txt cached.txt
removed 'all.txt'
removed 'cached.txt'

Running snakemake again try to link from the cache

snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: b6bf9837-c0fb-4031-9d67-6c1a61327ff4
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job      count
-----  -------
cache        1
total        1

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:52:41 2024]
rule cache:
    output: cached.txt
    jobid: 0
    reason: Missing output files: cached.txt
    resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5

No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name b6bf9837-c0fb-4031-9d67-6c1a61327ff4 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.18vdpw72' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers software-env input mtime params code --conda-frontend 'mamba' --shared-fs-usage source-cache software-deployment input-output persistence sources storage-local-copies --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13902541 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13902541.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0949852466583252 seconds
The output is:
'13902541|COMPLETED
'

status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13902541'}
active_jobs_seen_by_sacct are: {'13902541'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0942690372467041 seconds
The output is:
'13902541|COMPLETED
'

status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar  4 10:53:41 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-03-04T105241.639558.snakemake.log
unlocking
removing lock
removing lock
removed all locks

This time cached.txt is an actual file, not a symlink:

ll
-rw-r--r--.  1 jlanga jlanga    0 Mar  4 10:53 cached.txt
drwxrwsrwx.  2 jlanga jlanga 2882 Mar  4 10:49 snakecache
-rw-r--r--.  1 jlanga jlanga  276 Mar  4 10:23 Snakefile
drwxr-sr-x.  2 jlanga jlanga    0 Mar  4 10:18 test

Thanks!

amadeovezz commented 2 months ago

Also running into this issue and would be interested in a solution 👍