default resources runtime issue

elizakirsch0 commented 1 month ago

Hello,

Sorry to open up another issue! I am struggling with continuous slurm "TIMEOUT" errors despite setting my runtime to be 1080 minutes (7 days, the maximum length for the partition on the cluster I'm using). My bwa-mem jobs will fail after 24 hours or so:

Error in rule bwa_map:
    message: SLURM-job '26605857' failed, SLURM status is: 'TIMEOUT'.

and when I look at the .log file and this is the error I see:

slurmstepd: error: *** STEP 26605857.0 ON a02-10 CANCELLED AT 2024-10-24T05:31:53 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 26605857 ON a02-10 CANCELLED AT 2024-10-24T05:31:53 DUE TO TIME LIMIT ***

I am confused why the jobs are being cancelled due to time limit, because I have both the default resources and the specific resources for the bwa-mem rule set to 1080 minutes:

# These resources will be applied to all rules. Can be overriden on a per-rule basis below.
default-resources:
  mem_mb: attempt * 2000
  mem_mb_reduced: (attempt * 2000) * 0.9 # Mem allocated to java for GATK rules (tries to prevent OOM errors)
  slurm_partition: "largemem"
  slurm_account: "sedmands_1143"  #Same as sbatch -A. Not all clusters use this.
  runtime: 10080 # In minutes

#   # Alignment  
#   bwa_map:
#     mem_mb: 256000
#     slurm_partition: "largemem"
#     runtime: 10080
#     cpus_per_task:

Could these settings be getting overridden by snparcher somehow? Also this might be an issue more related to my cluster, so I understand if you can't help with fixing it.

Thank you!

cademirch commented 1 month ago

No worries. I think the issue is that you need to uncomment the lines so it should look like this:

#   # Alignment  
   bwa_map:
     mem_mb: 256000
     slurm_partition: "largemem"
     runtime: 10080
#     cpus_per_task:

Sorry this wasn't clear in the docs. Will try to remedy that. Thanks for your patience!!

cademirch commented 1 month ago

Oop just saw you did have the runtime big even as default. That is weird. Lets see if doing the above fixes this. If not can look more.

Edit: What is your command you use to execute snakemake?

elizakirsch0 commented 1 month ago

Ah okay, I will try uncommenting those lines, thank you!

This is my command to execute snakemake:

#!/bin/bash
#SBATCH -J sm                        # Job name
#SBATCH -o snpArcher_out_%j.txt      # Output file with Job ID
#SBATCH -e snpArcher_err_%j.txt      # Error file with Job ID
#SBATCH -p largemem                  # Partition
#SBATCH -n 8                         # Number of tasks (cores)
#SBATCH -t 10080                     # Time limit in minutes (7 days)
#SBATCH --mem=256G                   # Memory allocation
# Purge any loaded modules
module purge
# Load conda environment
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate snparcher  # Activate the snparcher environment
# Unlock the directory
snakemake --snakefile workflow/Snakefile --profile /project/sedmands_1143/ekirsch/V2_savannahsparrow/snp_archer/snpArcher/profiles/slurm --unlock --slurm-p
artition largemem
# Rerun incomplete jobs and specify largemem as default partition
snakemake --snakefile workflow/Snakefile \
  --profile /project/sedmands_1143/ekirsch/V2_savannahsparrow/snp_archer/snpArcher/profiles/slurm \
  --default-resources slurm_partition=largemem \
  --jobs 3 \
  --latency-wait 120 \
  --rerun-incomplete

elizakirsch0 commented 1 month ago

Hi Cade,

I tried editing my config file and uncommenting the lines. This is the edited chunk of code:

#   # Alignment  
   bwa_map:
     mem_mb: 256000
     slurm_partition: "largemem"
     runtime: 10080

However, now I am getting this error when I try to execute snakemake:

snakemake: error: Couldn't parse config file: while parsing a block mapping
  in "/project/sedmands_1143/ekirsch/V2_savannahsparrow/snp_archer/snpArcher/profiles/slurm/config.yaml", line 18, column 3
expected <block end>, but found '<block mapping start>'
  in "/project/sedmands_1143/ekirsch/V2_savannahsparrow/snp_archer/snpArcher/profiles/slurm/config.yaml", line 158, column 4

Line 158 is bwa_map from the chunk above, so it seems like the new formatting is causing an issue.

Not sure what's going on with line 18, but this is what that looks like (line 18 is the second line of this chunk):

# Reference Genome Processing. Does NOT use more than 1 thread.
  download_reference: 1
  index_reference: 1

cademirch commented 1 month ago

Hi @elizakirsch0, make sure the set-resources line is uncommented:

Before:

# Control other resources used by each rule.
# set-resources:

After:

# Control other resources used by each rule.
set-resources:

elizakirsch0 commented 4 weeks ago

That fixed it, thank you!

On Mon, Oct 28, 2024 at 3:03 PM Cade Mirchandani @.***> wrote:

Hi @elizakirsch0, make sure the set-resources line is uncommented: # Control other resources used by each rule. set-resources: — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned. Message

Hi @elizakirsch0 https://urldefense.com/v3/__https://github.com/elizakirsch0__;!!LIr3w8kk_Xxm!tDwh3Vj_Y4o5wkSHCHKK_qYkBkF8OHPaqpAc3QhZefqacS4hgt_fvsXdCNmst3ao2JRAIm_3EaUWo-8vKqBAgYg$, make sure the set-resources line is uncommented:

Control other resources used by each rule.set-resources:

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/harvardinformatics/snpArcher/issues/227*issuecomment-2442749018__;Iw!!LIr3w8kk_Xxm!tDwh3Vj_Y4o5wkSHCHKK_qYkBkF8OHPaqpAc3QhZefqacS4hgt_fvsXdCNmst3ao2JRAIm_3EaUWo-8vG7wzNJ4$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/BEUMJOGMJ24DOUQX7IIWWTLZ52YBTAVCNFSM6AAAAABQR7HRHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBSG42DSMBRHA__;!!LIr3w8kk_Xxm!tDwh3Vj_Y4o5wkSHCHKK_qYkBkF8OHPaqpAc3QhZefqacS4hgt_fvsXdCNmst3ao2JRAIm_3EaUWo-8vHcSWjJU$ . You are receiving this because you were mentioned.Message ID: @.***>

harvardinformatics / snpArcher

default resources runtime issue #227

Control other resources used by each rule.set-resources: