Slurm oom-kill due to memory is ignored.

oliverdrechsel commented 1 month ago

Bug report

Expected behavior and actual behavior

Slurm jobs that run out of memory get oom-killed. In nearly all cases this works. In an awk process i run there is excessive RAM usage that gets only logged in .command.log but is ignored in the nextflow process. This results in premature end of the awk processes leading to corrupted output.

Steps to reproduce the problem

The following code produces issues with fastq.gz files with 20 million reads or more.

process count_reads {

    label "count_reads"

    publishDir path: "${params.analysesdir}/${stage}/${sample}/", pattern: "*.csv", mode: "copy"

    // SLURM cluster options
    cpus 1
    memory "5 GB"
    time "1h"

    tag "readcount_${sample}"

    input:
        tuple val(sample), path(reads)
        val(stage)

    output:
        tuple val(sample), path("${sample}_read_count.csv"), emit: read_count

    script:
        """
            zless ${reads[0]} | awk 'END {printf "%s", "${sample},"; printf "%.0f", NR/4; print ""}' > ${sample}_read_count.csv

        """

    stub:
        """
            mkdir -p ${sample}
            touch ${sample}_read_count.csv
        """
}

Program output

In the nextflow.log the jobs look as if they are successful.

$ cat .command.log
slurmstepd-hpc-...: error: Detected 1 oom_kill event in StepId=71xxxxx.batch. Some of the step tasks have been OOM Killed.

Environment

Nextflow version: 24.04.2
Java version: openjdk 21-internal 2023-09-19
Operating system: Linux
Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Additional context

(Add any other context about the problem here)

bentsherman commented 1 month ago

Some questions:

Are you using the scratch directive for local scratch storage?
Can you share the output/error log for the slurm job?

I'm wondering if the failure happened with the process script or with the copying of task outputs. Possibly related to #3711

oliverdrechsel commented 1 month ago

Hi @bentsherman

no i'm not using the scratch directive
the .command.log is given in the bug report

Do you mean this output?

$ sjob 716xxxx

JobID                  : 716xxxx
Job name               : nf-ASSEMBLY_FILTER_count_trimmed_reads_(readcount_CK_47_030432_T_S)
State                  : OUT_OF_MEMORY
Reason | ExitCode      : None | 0:125
Priority               : 1721

SubmitLine             : sbatch .command.run
WorkDir                : /scratch/xxxx/188

Start Time             : 2024-09-23 06:16:07
End Time               : 2024-09-23 06:17:57

UserID                 : dxxxx
Account                : xxx
Partition              : main

Requested TRES         : billing=1,cpu=1,gres/local=10,mem=5G,node=1
Nodelist               : hpc-node03

TIME requested         :            01:00:00
TIME elapsed           :            00:01:50
TIME request efficiency:                   3%    [ 00:01:50 / 01:00:00 ]
TIME overbook          :                  31x    [ 01:00:00 / 00:01:50 - 1 ]

MEM requested          :                   5G
MEM max RSS            :                   5G
MEM request efficiency :                  91%    [ 4,575,784K / 5G ]
MEM overbook           :                   9%    [ 5G / 4,575,784K - 1 ]

CPUs requested         :                   1
CPUs allocated         :                   2     [ number of threads filled up to complete cores ]
CPU total time usage   :            00:02:17
CPU load average       :                   1.245 [ 00:02:17 / 00:01:50 ]
CPU request efficiency :                 124%    [ 00:02:17 / 00:01:50 / 1 ]
CPU alloc.  efficiency :                  62%    [ 00:02:17 / 00:01:50 / 2 ]

Disk Read Max          :            9,771.89M
Disk Write Max         :            8,834.29M

TRES Usage IN max      : cpu=00:02:18,energy=0,fs/disk=10246572489,mem=4575784K,pages=0,vmem=4649776K
TRES Usage OUT max     : energy=0,fs/disk=9263426962

[ locale settings LC_NUMERIC="en_US.UTF-8": decimal_point="." | thousands_sep="," ]

I'd doubt that this failure is linked to the output step, because it happens way before. As far as i can tell the job is killed by Slurm while running and the output is generated ignoring the kill.

nextflow-io / nextflow