Closed jlanga closed 2 months ago
It would be an awesome feature ...
Oh yes! And a but otherwise.
However, can you tell us: Is the cache directory available upon submit time and the same within job context?
If so, can you please prepare a minimal example and run with --verbose --cache ...
?
I'll be on a conference the next few days - will try to look into this issue, yet it might take a bit longer.
Sure!
This is the example snakefile:
rule cache:
output: "cached.txt"
cache: True
resources:
mem_mb=500,
runtime=5,
shell: "touch {output}"
rule all:
input: "cached.txt"
output: "all.txt"
resources:
mem_mb=500,
runtime=5
shell: "touch {output}"
mkdir --parents snakecache
chmod ugo+rwx snakecache
export SNAKEMAKE_OUTPUT_CACHE="$(pwd)/snakecache"
snakemake --version
8.5.3
snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: 249f4012-3b8d-43fd-b354-bfd1af4719e1
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job count
----- -------
all 1
cache 1
total 2
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:25:57 2024]
rule cache:
output: cached.txt
jobid: 1
reason: Missing output files: cached.txt
resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5
No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 1 has been submitted with SLURM jobid 13901748 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13901748.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09672927856445312 seconds
The output is:
'13901748|COMPLETED
'
status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901748'}
active_jobs_seen_by_sacct are: {'13901748'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09421706199645996 seconds
The output is:
'13901748|COMPLETED
'
status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar 4 10:26:57 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:26:57 2024]
rule all:
input: cached.txt
output: all.txt
jobid: 0
reason: Missing output files: all.txt; Input files updated by another job: cached.txt
resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5
sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/%j.log --export=ALL --comment all -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'all:' --allowed-rules 'all' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' 'cached.txt' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13901777 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/13901777.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.12538862228393555 seconds
The output is:
'13901748|COMPLETED
13901777|COMPLETED
'
status_of_jobs after sacct is: {'13901748': 'COMPLETED', '13901777': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901777'}
active_jobs_seen_by_sacct are: {'13901777'}
missing_sacct_status are: set()
Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar 4 10:27:08 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T102556.662489.snakemake.log
unlocking
removing lock
removing lock
removed all locks
The job finishes, but the cache is empty:
ls -l snakecache
total 0
rm -fv all.txt cached.txt
removed 'all.txt'
removed 'cached.txt'
snakemake --cache cache -j 1 -c 1 --verbose --latency-wait 60 all
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
----- -------
all 1
cache 1
total 2
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:49:12 2024]
localrule cache:
output: cached.txt
jobid: 1
reason: Missing output files: cached.txt
resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5
Moving output file cached.txt to cache.
Symlinking output file cached.txt from cache.
Completion of job ['cache'] reported to scheduler.
jobs registered as running before removal {cache}
[Mon Mar 4 10:49:12 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:49:12 2024]
localrule all:
input: cached.txt
output: all.txt
jobid: 0
reason: Missing output files: all.txt; Input files updated by another job: cached.txt
resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5
Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar 4 10:49:12 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T104912.299719.snakemake.log
unlocking
removing lock
removing lock
removed all locks
The cache:
ll snakecache
-rw-rw-rw-. 1 jlanga jlanga 0 Mar 4 10:49 d5281ea3cbea5533581d362bf3375f0d61f076708f54dee5360dba560de33b1b
cached.txt
is a symlink to the cached file.
Delete the output files to test caching again:
rm -fv all.txt cached.txt
removed 'all.txt'
removed 'cached.txt'
snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: b6bf9837-c0fb-4031-9d67-6c1a61327ff4
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job count
----- -------
cache 1
total 1
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:52:41 2024]
rule cache:
output: cached.txt
jobid: 0
reason: Missing output files: cached.txt
resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5
No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name b6bf9837-c0fb-4031-9d67-6c1a61327ff4 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.18vdpw72' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers software-env input mtime params code --conda-frontend 'mamba' --shared-fs-usage source-cache software-deployment input-output persistence sources storage-local-copies --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13902541 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13902541.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0949852466583252 seconds
The output is:
'13902541|COMPLETED
'
status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13902541'}
active_jobs_seen_by_sacct are: {'13902541'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0942690372467041 seconds
The output is:
'13902541|COMPLETED
'
status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar 4 10:53:41 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-03-04T105241.639558.snakemake.log
unlocking
removing lock
removing lock
removed all locks
This time cached.txt
is an actual file, not a symlink:
ll
-rw-r--r--. 1 jlanga jlanga 0 Mar 4 10:53 cached.txt
drwxrwsrwx. 2 jlanga jlanga 2882 Mar 4 10:49 snakecache
-rw-r--r--. 1 jlanga jlanga 276 Mar 4 10:23 Snakefile
drwxr-sr-x. 2 jlanga jlanga 0 Mar 4 10:18 test
Thanks!
Also running into this issue and would be interested in a solution 👍
I was looking into this a bit:
To my understanding of the code, the issue is that the cache
argument does not get passed to the snakemake
command written to the jobfile.
This may not be a slurm
executor issue, but rather a general issue of snakemake
not passing the cache
argument to the exec_job
command:
Following the code trail:
The Executor
calss of snakemake-executor-plugin-slurm
uses it's parent classes method format_job_exec
to get the snakemake command to run: https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/c414f6571d5de30e834162ca987a5cdf81f869c5/snakemake_executor_plugin_slurm/__init__.py#L196
This in defined in the RealExecutor
abstract class:
https://github.com/snakemake/snakemake-interface-executor-plugins/blob/56b8c84a18b3b8b6a0497201cf17defbbaffa425/snakemake_interface_executor_plugins/executors/real.py#L141-L177
Here the job command is built using several building blocks, of which I think one should add the --cache
argument.
To me it looks like the appropriate place would be general_args
:
https://github.com/snakemake/snakemake-interface-executor-plugins/blob/56b8c84a18b3b8b6a0497201cf17defbbaffa425/snakemake_interface_executor_plugins/executors/real.py#L148-L150
This in turn is based on self.workflow.spawned_job_args_factory.general_args
of the snakemake
package, that prepares job arguments:
https://github.com/snakemake/snakemake/blob/2ec61f489785b19529665ada3f9d1811c47dbbcd/snakemake/spawn_jobs.py#L227
To my understanding this may be an appropriate place to make sure the cache argument is passed?
I made a branch that adds this argument: https://github.com/snakemake/snakemake/compare/main...FroeseLab:snakemake:feature/cacheinjobs
and this indeed makes the caching work as well with this executor. With this fix the executor submits a job that then uses caching. Would this be a good solution to the issue or would it be more appropriate to solve this on the plugin level?
Arguably an additional optimization would be to actually check the cache before submitting a job (ie if the rule is already cached, dont submit a job but directly link the results).
I made a branch that adds this argument: ...
That would be great. Please open a PR at the snakemake main repository.
It works! Thanks!!! Great work!!!
It would be an awesome feature to recycle past computations via
--cache
.I have tried using together the slurm executor and --cache, but the latter is completely ignored, even when the results are already in the cache.