Closed Dictionary2b closed 4 months ago
Hi Zongzhuang, sorry for your issues with this. Could you provide your config, resource config, and command line used to run snparcher? The create_cov_bed step is pretty memory intensive so that could be the issue with regards to your posted error log. As for slow progress, that could be a number of things outside of snpArcher's control, such as HPC queue and resource limits. However, please post those things above, and we can try to diagnose.
@cademirch Thanks for your suggestion! Here are the config, resources, and the bash script I used to run snparcher (.sh). They are all archived in this zip file.
By asking the UPPMAX support, I got a suggestion to ask for a fat node partition job of at least 512 GB (-C 512 GB) instead of using the core partition as I did, which has at most 128 GB. Is this something I can change in the cluster-config file? Besides, since there are only a few fat nodes on the cluster, is it possible to make the workflow only submit the jobs with intensive memory requirements in the node partition, and the others keep with the previous setting? I'm also unsure if I need to kill the current process and restart it to apply the changes.
Hi Zonghuang,
I took a look and you configs look OK to me. One thing I will suggest is using the --slurm
option when executing Snakemake instead of the profile. See these docs for more details: https://snakemake.readthedocs.io/en/v7.32.3/executing/cluster.html#executing-on-slurm-clusters
As for submitting certain rules to specific partitions this is possible, the docs above detail how. I would suggest creating a YAML profile and you can define which rules will go to which partition. I'll provide an example here:
uppmax_example_profile.yaml
slurm: True # same as `--slurm on command line`
jobs: 1000 # set number of jobs to run concurrently
use-conda: True
# can set other wanted command line options here
default-resources:
slurm_partition: <Your partition name here> # This will be default partition for all rules
slurm_account: <Your slurm account> # If applicable on your cluster
set-resources:
create_cov_bed:
slurm_partition: <Your big partition name here> # This will override default partition for this rule
# ... you can specify partitions for certain rules by following this pattern
Then when you run snparcher you can do so with this profile. Let me know if this helps!
Hi Cade,
Many thanks for your explanation!
I'm unsure whether using the --slurm
option without the --profile
option can work in this case. The cpus-per-task issue on slurm system seems to persist still, which is now partially solved by an edition in profiles/slurm/slurm_utils.py.
If this is the case that I still have to use the --profile
option, can I probably modify profiles/slurm/config.yaml (or something in cluster_config.yml?) for submitting certain rules to specific partitions?
Okay, sorry I didn't realize that issue also. So in your shell script you sent above looks like this:
❯ cat run_pipeline_zongz1123.sh
#!/bin/bash
#SBATCH -A naiss2023-5-278
#SBATCH -p core
#SBATCH -n 1
#SBATCH -t 10-00:00:00
#SBATCH -J snpArcher
#SBATCH -e snpArcher_%A_%a.err # File to which STDERR will be written
#SBATCH -o snpArcher_%A_%a.out
#SBATCH --mail-type=all
#SBATCH --mail-user=dictionary2b@gmail.com
module load conda/latest
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
mamba activate snparcher
snakemake --snakefile workflow/Snakefile --profile ./profiles/slurm
You would edit the file .profiles/slurm
to include this:
default-resources:
slurm_partition: <Your partition name here> # This will be default partition for all rules
slurm_account: <Your slurm account> # If applicable on your cluster
set-resources:
create_cov_bed:
slurm_partition: <Your big partition name here> # This will override default partition for this rule
# ... you can specify partitions for certain rules by following this pattern
Let me know if this makes sense and is helpful!
Thanks, Cade. .profiles/slurm
is a directory containing both config. yaml
and cluster_config.yml
. In this case, I can't find the right place to include the code directly as you suggested. To my understanding, if I want to add a specific resource setting for certain jobs, I would need to add it to cluster_config.yml
to make it look like this:
__default__:
partition: "snowy"
time: 7-00:00:00
partition: core
ntasks: 8
output: "logs/slurm/slurm-%j.out"
account: naiss2023-5-278
create_cov_bed:
partition: "snowy"
time: 7-00:00:00
partition: node
nodes: 1
ntasks: 8
constraint: mem512GB
output: "logs/slurm/slurm-%j.out"
account: naiss2023-5-278
Do I understand you correctly? Sorry for the misunderstanding!
Ah my apologies, I messed up with what I posted. I believe what you have is correct. It's a bit confusing between the two main ways to run SLURM with snakemake. I will look more into the issue you posted above as well now that I have a slurm cluster to test on.
Hello!
Just to follow up on this discussion, how is memory getting determined for this rule? Is it modifiable and capabale of being run with lower memory? looking at the create_cov_bed
rule I don't see any resources
section.
The discussion above could be a potential solution, but we're running this rule and it's trying to request a ton of memory (1,700 GB) that may not exist on any node on our computing cluster (error message below saying "Requested node configuration is not available").
If it's useful information, we're using low-depth human sample (~325) mapped to the hg38 genome, which is quite complete.
Sincerley, Brian
[Wed Feb 7 13:22:58 2024]
rule create_cov_bed:
input: results/hg38/summary_stats/all_cov_sumstats.txt, results/hg38/callable_sites/all_samples.d4
output: results/hg38/callable_sites/past_and_turk_callable_sites_cov.bed
jobid: 1634
benchmark: benchmarks/hg38/covbed/past_and_turk_benchmark.txt
reason: Missing output files: results/hg38/callable_sites/past_and_turk_callable_sites_cov.bed
wildcards: refGenome=hg38, prefix=past_and_turk
resources: mem_mb=1743686, mem_mib=1662909, disk_mb=1743686, disk_mib=1662909, tmpdir=
We have seen this a number of times - the default memory specification seems to go off the rails for a reason we don't yet understand.
One solution is to just define mem_mb = some other reasonable number in the snakemake rule directly, in the resources section.
We are hoping to debug this but so far haven't tracked down the problem.
I think this may be happening since create_cov_bed
is not defined in the resources yaml, so Snakemake comes up with a default:
https://github.com/snakemake/snakemake/blob/0998cc57cbd02c38d1a3bbf1662c8b23b7601e20/snakemake/resources.py#L11-L16
Hello, this issue has failed at the qc module of 2/3 otherwise successful runs. I have defined
resources:
mem_mb = 16000
in the workflow/modules/qc/Snakemake file before the 'run' or 'shell' command in every rule, but the error and job failure persist. Snakefile and err file attached.
As you say Cade, the first error is for create_cov_bed, which is not a rule on this Snakefile. err336.txt Snakefile.txt
Ah my apologies, I messed up with what I posted. I believe what you have is correct. It's a bit confusing between the two main ways to run SLURM with snakemake. I will look more into the issue you posted above as well now that I have a slurm cluster to test on.
Thanks, Cade. The workflow is now appropriately finished. Defining the specific resource allocation of each job in cluster_config.yml
, as I did, is the solution. : )
Hello, I was running the workflow for an extensive data set (over 800 samples) on a slrum platform (UPPMAX). I used the GATK approach with intervals. I got an error message like this: rule create_cov_bed: input: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4 output: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed jobid: 2558 benchmark: benchmarks/GCA_009792885.1/covbed/lark20231207_benchmark.txt reason: Missing output files: results/GCA_009792885.1/callable_sites/lark20231207_callable_sites_cov.bed; Input files updated by another job: results/GCA_009792885.1/summary_stats/all_cov_sumstats.txt, results/GCA_009792885.1/callable_sites/all_samples.d4 wildcards: refGenome=GCA_009792885.1, prefix=lark20231207 resources: mem_mb=448200, mem_mib=427437, disk_mb=448200, disk_mib=427437, tmpdir=
sbatch: error: Memory specification can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available Traceback (most recent call last): File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/./profiles/slurm/slurm-submit.py", line 59, in
print(slurm_utils.submit_job(jobscript, *sbatch_options))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 131, in submit_job
raise e
File "/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/profiles/slurm/slurm_utils.py", line 129, in submit_job
res = subprocess.check_output(["sbatch"] + optsbatch_options + [jobscript])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 466, in check_output
return run(popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sw/apps/conda/latest/rackham/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--partition=core', '--time=7-00:00:00', '--ntasks=8', '--output=logs/slurm/slurm-%j.out', '--account=naiss2023-5-278', '--mem=448200', '/crex/proj/uppstore2019097/nobackup/zongzhuang_larks_working/Final_variant_callinglarks/snpArcher/.snakemake/tmp.fi5cgg1q/snakejob.create_cov_bed.2558.sh']' returned non-zero exit status 1.
Error submitting jobscript (exit code 1):
Select jobs to execute...
There is not enough memory on the slurm system. I'm not sure where the issue is, whether due to I didn't ask for large enough memory in the setting or the HPC simply cannot provide that much memory. Should I probably change the source code to ask for a larger memory allocation for the workflow from the beginning (e.g. more than 1000 nodes)?
Besides, the whole workflow runs not that fast either; for the bam2vcf jobs, it reported a progress of 2% every 24 hours and obviously will go over the time limitation of the snakemake slurm job. Could you please give some suggestions on this? Thanks in advance.
Best, Zongzhuang