Open jlanga opened 4 months ago
It would be an awesome feature ...
Oh yes! And a but otherwise.
However, can you tell us: Is the cache directory available upon submit time and the same within job context?
If so, can you please prepare a minimal example and run with --verbose --cache ...
?
I'll be on a conference the next few days - will try to look into this issue, yet it might take a bit longer.
Sure!
This is the example snakefile:
rule cache:
output: "cached.txt"
cache: True
resources:
mem_mb=500,
runtime=5,
shell: "touch {output}"
rule all:
input: "cached.txt"
output: "all.txt"
resources:
mem_mb=500,
runtime=5
shell: "touch {output}"
mkdir --parents snakecache
chmod ugo+rwx snakecache
export SNAKEMAKE_OUTPUT_CACHE="$(pwd)/snakecache"
snakemake --version
8.5.3
snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: 249f4012-3b8d-43fd-b354-bfd1af4719e1
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job count
----- -------
all 1
cache 1
total 2
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:25:57 2024]
rule cache:
output: cached.txt
jobid: 1
reason: Missing output files: cached.txt
resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5
No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 1 has been submitted with SLURM jobid 13901748 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13901748.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09672927856445312 seconds
The output is:
'13901748|COMPLETED
'
status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901748'}
active_jobs_seen_by_sacct are: {'13901748'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.09421706199645996 seconds
The output is:
'13901748|COMPLETED
'
status_of_jobs after sacct is: {'13901748': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar 4 10:26:57 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:26:57 2024]
rule all:
input: cached.txt
output: all.txt
jobid: 0
reason: Missing output files: all.txt; Input files updated by another job: cached.txt
resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5
sbatch call: sbatch --job-name 249f4012-3b8d-43fd-b354-bfd1af4719e1 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/%j.log --export=ALL --comment all -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'all:' --allowed-rules 'all' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.otgwkxwl' 'cached.txt' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers software-env code params input mtime --conda-frontend 'mamba' --shared-fs-usage sources input-output storage-local-copies source-cache persistence software-deployment --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13901777 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_all/13901777.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name 249f4012-3b8d-43fd-b354-bfd1af4719e1
It took: 0.12538862228393555 seconds
The output is:
'13901748|COMPLETED
13901777|COMPLETED
'
status_of_jobs after sacct is: {'13901748': 'COMPLETED', '13901777': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13901777'}
active_jobs_seen_by_sacct are: {'13901777'}
missing_sacct_status are: set()
Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar 4 10:27:08 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T102556.662489.snakemake.log
unlocking
removing lock
removing lock
removed all locks
The job finishes, but the cache is empty:
ls -l snakecache
total 0
rm -fv all.txt cached.txt
removed 'all.txt'
removed 'cached.txt'
snakemake --cache cache -j 1 -c 1 --verbose --latency-wait 60 all
Assuming unrestricted shared filesystem usage for local execution.
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
----- -------
all 1
cache 1
total 2
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:49:12 2024]
localrule cache:
output: cached.txt
jobid: 1
reason: Missing output files: cached.txt
resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5
Moving output file cached.txt to cache.
Symlinking output file cached.txt from cache.
Completion of job ['cache'] reported to scheduler.
jobs registered as running before removal {cache}
[Mon Mar 4 10:49:12 2024]
Finished job 1.
1 of 2 steps (50%) done
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:49:12 2024]
localrule all:
input: cached.txt
output: all.txt
jobid: 0
reason: Missing output files: all.txt; Input files updated by another job: cached.txt
resources: tmpdir=/tmp, mem_mb=1024, mem_mib=977, runtime=5
Completion of job ['all'] reported to scheduler.
jobs registered as running before removal {all}
[Mon Mar 4 10:49:12 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-03-04T104912.299719.snakemake.log
unlocking
removing lock
removing lock
removed all locks
The cache:
ll snakecache
-rw-rw-rw-. 1 jlanga jlanga 0 Mar 4 10:49 d5281ea3cbea5533581d362bf3375f0d61f076708f54dee5360dba560de33b1b
cached.txt
is a symlink to the cached file.
Delete the output files to test caching again:
rm -fv all.txt cached.txt
removed 'all.txt'
removed 'cached.txt'
snakemake --cache cache -j 1 -c 1 --executor slurm --verbose --latency-wait 60
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: b6bf9837-c0fb-4031-9d67-6c1a61327ff4
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job count
----- -------
cache 1
total 1
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...
[Mon Mar 4 10:52:41 2024]
rule cache:
output: cached.txt
jobid: 0
reason: Missing output files: cached.txt
resources: tmpdir=<TBD>, mem_mb=1024, mem_mib=977, runtime=5
No SLURM account given, trying to guess.
Guessed SLURM account: mjolnir
sbatch call: sbatch --job-name b6bf9837-c0fb-4031-9d67-6c1a61327ff4 --output /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/%j.log --export=ALL --comment cache -A mjolnir -p cpuqueue -t 5 --mem 1024 --cpus-per-task=1 -D /maps/projects/mjolnir1/people/jlanga --wrap="/home/jlanga/.miniconda3/envs/snake8/bin/python3.12 -m snakemake --snakefile '/maps/projects/mjolnir1/people/jlanga/Snakefile' --target-jobs 'cache:' --allowed-rules 'cache' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1024' 'mem_mib=977' --wait-for-files '/maps/projects/mjolnir1/people/jlanga/.snakemake/tmp.18vdpw72' --force --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose --rerun-triggers software-env input mtime params code --conda-frontend 'mamba' --shared-fs-usage source-cache software-deployment input-output persistence sources storage-local-copies --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 60 --scheduler 'ilp' --scheduler-solver-path '/home/jlanga/.miniconda3/envs/snake8/bin' --default-resources 'tmpdir=system_tmpdir' --executor slurm-jobstep --jobs 1 --mode 'remote'"
Job 0 has been submitted with SLURM jobid 13902541 (log: /maps/projects/mjolnir1/people/jlanga/.snakemake/slurm_logs/rule_cache/13902541.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0949852466583252 seconds
The output is:
'13902541|COMPLETED
'
status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: {'13902541'}
active_jobs_seen_by_sacct are: {'13902541'}
missing_sacct_status are: set()
Completion of job ['cache'] reported to scheduler.
Waiting at most 60 seconds for missing files.
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-03-02T10:00 --endtime now --name b6bf9837-c0fb-4031-9d67-6c1a61327ff4
It took: 0.0942690372467041 seconds
The output is:
'13902541|COMPLETED
'
status_of_jobs after sacct is: {'13902541': 'COMPLETED'}
active_jobs_ids_with_current_sacct_status are: set()
active_jobs_seen_by_sacct are: set()
missing_sacct_status are: set()
jobs registered as running before removal {cache}
[Mon Mar 4 10:53:41 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-03-04T105241.639558.snakemake.log
unlocking
removing lock
removing lock
removed all locks
This time cached.txt
is an actual file, not a symlink:
ll
-rw-r--r--. 1 jlanga jlanga 0 Mar 4 10:53 cached.txt
drwxrwsrwx. 2 jlanga jlanga 2882 Mar 4 10:49 snakecache
-rw-r--r--. 1 jlanga jlanga 276 Mar 4 10:23 Snakefile
drwxr-sr-x. 2 jlanga jlanga 0 Mar 4 10:18 test
Thanks!
Also running into this issue and would be interested in a solution 👍
It would be an awesome feature to recycle past computations via
--cache
.I have tried using together the slurm executor and --cache, but the latter is completely ignored, even when the results are already in the cache.