`--cache` is ignored when using the executor

jlanga commented 8 months ago

It would be an awesome feature to recycle past computations via --cache.

I have tried using together the slurm executor and --cache, but the latter is completely ignored, even when the results are already in the cache.

cmeesters commented 8 months ago

It would be an awesome feature ...

Oh yes! And a but otherwise.

However, can you tell us: Is the cache directory available upon submit time and the same within job context?

If so, can you please prepare a minimal example and run with --verbose --cache ...?

I'll be on a conference the next few days - will try to look into this issue, yet it might take a bit longer.

jlanga commented 8 months ago

Sure!

This is the example snakefile:

rule cache:
    output: "cached.txt"
    cache: True
    resources:
        mem_mb=500,
        runtime=5,
    shell: "touch {output}"

rule all:
    input: "cached.txt"
    output: "all.txt"
    resources:
        mem_mb=500,
        runtime=5
    shell: "touch {output}"

Setup

mkdir --parents snakecache
chmod ugo+rwx snakecache
export SNAKEMAKE_OUTPUT_CACHE="$(pwd)/snakecache"
snakemake --version

8.5.3

The output is not sent to the cache

snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60

Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: 249f4012-3b8d-43fd-b354-bfd1af4719e1
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job      count
-----  -------
all          1
cache        1
total        2

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:25:57 2024]
rule cache:
    output: cached.txt
    jobid: 1
    reason: Missing output files: cached.txt
    resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5

No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 1 has been submitted with SLURM jobid 13901748 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13901748.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09672927856445312 seconds
The output is:
'13901748|COMPLETED
'

status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901748'}
active_jobs_seen_by_sacct are: {'13901748'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09421706199645996 seconds
The output is:
'13901748|COMPLETED
'

status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar  4 10:26:57 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:26:57 2024]
rule all:
    input: cached.txt
    output: all.txt
    jobid: 0
    reason: Missing output files: all.txt; Input files updated by another job: cached.txt
    resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5

sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/%j.log --export=ALL --comment all -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'all:' --allowed-rules 'all' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' 'cached.txt' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13901777 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/13901777.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.12538862228393555 seconds
The output is:
'13901748|COMPLETED
13901777|COMPLETED
'

status_of_jobs after sacct is: {'13901748': 'COMPLETED', '13901777': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901777'}
active_jobs_seen_by_sacct are: {'13901777'}
missing_sacct_status are: set()
Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar  4 10:27:08 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T102556.662489.snakemake.log
unlocking
removing lock
removing lock
removed all locks

The job finishes, but the cache is empty:

ls -l snakecache

total 0

Running the job locally to cache it

rm -fv all.txt cached.txt

removed 'all.txt'
removed 'cached.txt'

 snakemake --cache cache -j 1 -c 1 --verbose --latency-wait 60 all

Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job      count
-----  -------
all          1
cache        1
total        2

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:49:12 2024]
localrule cache:
    output: cached.txt
    jobid: 1
    reason: Missing output files: cached.txt
    resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5

Moving output file cached.txt to cache.
Symlinking output file cached.txt from cache.
Completion of job ['cache'] reported to scheduler.
jobs registered as running before removal {cache}
[Mon Mar  4 10:49:12 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:49:12 2024]
localrule all:
    input: cached.txt
    output: all.txt
    jobid: 0
    reason: Missing output files: all.txt; Input files updated by another job: cached.txt
    resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5

Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar  4 10:49:12 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T104912.299719.snakemake.log
unlocking
removing lock
removing lock
removed all locks

The cache:

ll snakecache

-rw-rw-rw-. 1 jlanga jlanga          0 Mar  4 10:49 d5281ea3cbea5533581d362bf3375f0d61f076708f54dee5360dba560de33b1b

cached.txt is a symlink to the cached file.

Delete the output files to test caching again:

rm -fv all.txt cached.txt

removed 'all.txt'
removed 'cached.txt'

Running snakemake again try to link from the cache

snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60

Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: b6bf9837-c0fb-4031-9d67-6c1a61327ff4
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job      count
-----  -------
cache        1
total        1

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Mar  4 10:52:41 2024]
rule cache:
    output: cached.txt
    jobid: 0
    reason: Missing output files: cached.txt
    resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5

No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name b6bf9837-c0fb-4031-9d67-6c1a61327ff4 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads  --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.18vdpw72' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers software-env input mtime params code --conda-frontend 'mamba' --shared-fs-usage source-cache software-deployment input-output persistence sources storage-local-copies --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13902541 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13902541.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0949852466583252 seconds
The output is:
'13902541|COMPLETED
'

status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13902541'}
active_jobs_seen_by_sacct are: {'13902541'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0942690372467041 seconds
The output is:
'13902541|COMPLETED
'

status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar  4 10:53:41 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-03-04T105241.639558.snakemake.log
unlocking
removing lock
removing lock
removed all locks

This time cached.txt is an actual file, not a symlink:

ll

-rw-r--r--.  1 jlanga jlanga    0 Mar  4 10:53 cached.txt
drwxrwsrwx.  2 jlanga jlanga 2882 Mar  4 10:49 snakecache
-rw-r--r--.  1 jlanga jlanga  276 Mar  4 10:23 Snakefile
drwxr-sr-x.  2 jlanga jlanga    0 Mar  4 10:18 test

Thanks!

amadeovezz commented 6 months ago

Also running into this issue and would be interested in a solution 👍

votti commented 2 months ago

I was looking into this a bit: To my understanding of the code, the issue is that the cache argument does not get passed to the snakemake command written to the jobfile. This may not be a slurm executor issue, but rather a general issue of snakemake not passing the cache argument to the exec_job command:

Following the code trail: The Executor calss of snakemake-executor-plugin-slurm uses it's parent classes method format_job_exec to get the snakemake command to run: https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/c414f6571d5de30e834162ca987a5cdf81f869c5/snakemake_executor_plugin_slurm/__init__.py#L196

This in defined in the RealExecutor abstract class: https://github.com/snakemake/snakemake-interface-executor-plugins/blob/56b8c84a18b3b8b6a0497201cf17defbbaffa425/snakemake_interface_executor_plugins/executors/real.py#L141-L177

Here the job command is built using several building blocks, of which I think one should add the --cache argument. To me it looks like the appropriate place would be general_args: https://github.com/snakemake/snakemake-interface-executor-plugins/blob/56b8c84a18b3b8b6a0497201cf17defbbaffa425/snakemake_interface_executor_plugins/executors/real.py#L148-L150

This in turn is based on self.workflow.spawned_job_args_factory.general_args of the snakemake package, that prepares job arguments: https://github.com/snakemake/snakemake/blob/2ec61f489785b19529665ada3f9d1811c47dbbcd/snakemake/spawn_jobs.py#L227

To my understanding this may be an appropriate place to make sure the cache argument is passed?

I made a branch that adds this argument: https://github.com/snakemake/snakemake/compare/main...FroeseLab:snakemake:feature/cacheinjobs

and this indeed makes the caching work as well with this executor. With this fix the executor submits a job that then uses caching. Would this be a good solution to the issue or would it be more appropriate to solve this on the plugin level?

Arguably an additional optimization would be to actually check the cache before submitting a job (ie if the rule is already cached, dont submit a job but directly link the results).

cmeesters commented 2 months ago

I made a branch that adds this argument: ...

That would be great. Please open a PR at the snakemake main repository.

jlanga commented 2 months ago

It works! Thanks!!! Great work!!!

snakemake / snakemake-executor-plugin-slurm