Issues with allocated memory

Angel-Popa commented 2 years ago

Hi,

I am noticing that all my rules request memory twice, one at a lower maximum than what I requested (mem_mb) and then what I actually requested (mem_gb). If I run the rules as localrules they do run faster. How can I make sure the default settings do not interfere?

resources: mem_mb=100, disk_mb=8620, tmpdir=/tmp/$USER.54835, partition=h24, qos=normal, mem_gb=100, time=120:00:00

The rules are as follows:

rule bwa_mem2_mem:
    input:
        R1 = "data/results/qc/{species}.{population}.{individual}_1.fq.gz",
        R2 = "data/results/qc/{species}.{population}.{individual}_2.fq.gz", 
        R1_unp = "data/results/qc/{species}.{population}.{individual}_1_unp.fq.gz",
        R2_unp = "data/results/qc/{species}.{population}.{individual}_2_unp.fq.gz",
        idx= "data/results/genome/genome",
        ref = "data/results/genome/genome.fa"
    output:
        bam = "data/results/mapped_reads/{species}.{population}.{individual}.bam",
    log:
        bwa ="logs/bwa_mem2/{species}.{population}.{individual}.log",
        sam ="logs/samtools_view/{species}.{population}.{individual}.log",
    benchmark:
        "benchmark/bwa_mem2_mem/{species}.{population}.{individual}.tsv",
    resources:
        time = parameters["bwa_mem2"]["time"],
        mem_mb = parameters["bwa_mem2"]["mem_gb"],        
    params:
        extra = parameters["bwa_mem2"]["extra"],
        tag = compose_rg_tag,
    threads:
        parameters["bwa_mem2"]["threads"],
    shell:
        "bwa-mem2 mem -t {threads} -R '{params.tag}' {params.extra} {input.idx} {input.R1} {input.R2} | "
        "samtools sort -l 9 -o {output.bam} --reference {input.ref} --output-fmt CRAM -@ {threads} /dev/stdin 2> {log.sam}"

and the config is:

cluster:
  mkdir -p logs/{rule} && # change the log file to logs/slurm/{rule}
  sbatch
    --partition={resources.partition}
    --time={resources.time}
    --qos={resources.qos}
    --cpus-per-task={threads}
    --mem={resources.mem_gb}
    --job-name=smk-{rule}-{wildcards}
    --output=logs/{rule}/{rule}-{wildcards}-%j.out
    --parsable # Required to pass job IDs to scancel
default-resources:
  - partition=h24
  - qos=normal
  - mem_gb=100
  - time="04:00:00"
restart-times: 3
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 100
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True # Required to run with local conda enviroment
cluster-status: status-sacct.sh # Required to monitor the status of the submitted jobs
cluster-cancel: scancel # Required to cancel the jobs with Ctrl + C 
cluster-cancel-nargs: 50

Cheers, Angel

jdblischak commented 2 years ago

It looks like you are using both mem_mb and mem_gb, is that right? I'd recommend sticking to using just one or the other. I tend to use mem_mb because the argument --mem to sbatch assumes the unit of measurement is MB if there is no unit, which allows me to keep it as an integer.

But I could give better advice if I knew what you wanted to accomplish. Do you want to specify the memory for your jobs in MB or GB? The actual name used in the Snakefile and config.yaml is not important, it just has to be consistent between the two files.

Assuming you want to use GB, I would recommend the following 2 steps:

Replace all instances of mem_mb with mem_gb
Change the sbatch line so that it always append a G so that the units are interpreted as GB
```
--mem={resources.mem_gb}G
```

The main advantage of this method is that you can do math on the integer values in the Snakefile (eg increase the memory for retries based on the variable attempt).

jdblischak commented 2 years ago

one at a lower maximum than what I requested (mem_mb) and then what I actually requested (mem_gb)

Reading over this again, one issue is clearly that sbatch is being passed --mem={resources.mem_gb} in config.yaml, but the rule bwa_mem2_mem only defines mem_mb:

resources:
        time = parameters["bwa_mem2"]["time"],
        mem_mb = parameters["bwa_mem2"]["mem_gb"],

Thus sbatch is passed the default value of mem_gb=100 (which gets interpreted by sbatch as 100 MB). If you changed a single m (memmb) to g (memgb), then it would use the value of parameters["bwa_mem2"]["mem_gb"]:

resources:
        time = parameters["bwa_mem2"]["time"],
        mem_gb = parameters["bwa_mem2"]["mem_gb"],

However, that would only work if parameters["bwa_mem2"]["mem_gb"] is a string that ends in G. If it is only a number, then sbatch will interpret that as MB.

jdblischak commented 2 years ago

@Angel-Popa I created an example to demonstrate how to specify memory in GB. Please check it out at https://github.com/jdblischak/smk-simple-slurm/tree/main/examples/mem-gb

Angel-Popa commented 2 years ago

Hi John,

So I made the changes as you suggested and I keep getting the same issue. I also get it with the example you linked above.

[Wed Sep 28 10:51:33 2022]
rule dynamic_resources:
    output: output/dynamic-resources.txt
    jobid: 3
    reason: Missing output files: output/dynamic-resources.txt
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp/$USER.54835, mem_gb=5

jdblischak commented 2 years ago

I also get it with the example you linked above.

You can ignore mem_mb=1000. It is one of Snakemake's standard resources, thus it will always be set. I updated the documentation and the Snakefile for my example mem-gb to make it more obvious that Slurm is requesting the correct amount of memory. It runs sacct so you can directly observe the memory allocation for each job.

This is a clear downside of specifying memory in GB instead of MB. However, you can rest assured that sbatch is submitting jobs with the correct memory allocation.

jdblischak commented 1 year ago

xref: https://github.com/Snakemake-Profiles/slurm/issues/89

jdblischak / smk-simple-slurm

Issues with allocated memory #11