Open oliverdrechsel opened 1 month ago
Some questions:
scratch
directive for local scratch storage?I'm wondering if the failure happened with the process script or with the copying of task outputs. Possibly related to #3711
Hi @bentsherman
scratch
directive.command.log
is given in the bug reportDo you mean this output?
$ sjob 716xxxx
JobID : 716xxxx
Job name : nf-ASSEMBLY_FILTER_count_trimmed_reads_(readcount_CK_47_030432_T_S)
State : OUT_OF_MEMORY
Reason | ExitCode : None | 0:125
Priority : 1721
SubmitLine : sbatch .command.run
WorkDir : /scratch/xxxx/188
Start Time : 2024-09-23 06:16:07
End Time : 2024-09-23 06:17:57
UserID : dxxxx
Account : xxx
Partition : main
Requested TRES : billing=1,cpu=1,gres/local=10,mem=5G,node=1
Nodelist : hpc-node03
TIME requested : 01:00:00
TIME elapsed : 00:01:50
TIME request efficiency: 3% [ 00:01:50 / 01:00:00 ]
TIME overbook : 31x [ 01:00:00 / 00:01:50 - 1 ]
MEM requested : 5G
MEM max RSS : 5G
MEM request efficiency : 91% [ 4,575,784K / 5G ]
MEM overbook : 9% [ 5G / 4,575,784K - 1 ]
CPUs requested : 1
CPUs allocated : 2 [ number of threads filled up to complete cores ]
CPU total time usage : 00:02:17
CPU load average : 1.245 [ 00:02:17 / 00:01:50 ]
CPU request efficiency : 124% [ 00:02:17 / 00:01:50 / 1 ]
CPU alloc. efficiency : 62% [ 00:02:17 / 00:01:50 / 2 ]
Disk Read Max : 9,771.89M
Disk Write Max : 8,834.29M
TRES Usage IN max : cpu=00:02:18,energy=0,fs/disk=10246572489,mem=4575784K,pages=0,vmem=4649776K
TRES Usage OUT max : energy=0,fs/disk=9263426962
[ locale settings LC_NUMERIC="en_US.UTF-8": decimal_point="." | thousands_sep="," ]
I'd doubt that this failure is linked to the output step, because it happens way before. As far as i can tell the job is killed by Slurm while running and the output is generated ignoring the kill.
Bug report
Expected behavior and actual behavior
Slurm jobs that run out of memory get oom-killed. In nearly all cases this works. In an awk process i run there is excessive RAM usage that gets only logged in .command.log but is ignored in the nextflow process. This results in premature end of the awk processes leading to corrupted output.
Steps to reproduce the problem
The following code produces issues with fastq.gz files with 20 million reads or more.
Program output
In the nextflow.log the jobs look as if they are successful.
Environment
Additional context
(Add any other context about the problem here)