Process terminated for an unknown reason (SLURM)

Description of the bug

When running the atac-seq pipeline on our SLURM cluster, it keeps failing at seemingly arbitrary points, with an error message saying that the process was "terminated for an unknown reason -- Likely it has been terminated by the external system" (see full error below).

When resuming the pipeline, without any changes in parameters or anything, it usually does get past the previously terminated process and then fails again at a later step, with the same error message. If I keep resuming the pipeline, eventually it does reach the end.

When a process fails, the working directory contains only two files:

.command.sh
.command.run

No .out, .trace, .exitcode, ... and also no symlinks to the input data have been created. If a manually submit the .command.run script to the cluster, without making any changes, it succeeds without any problem and all the files are there.

I have been in touch with our IT support in charge of managing the cluster but they also have no clue what is happening. We used to have a Sun Grid Engine cluster, on which the pipeline ran without problems. The issue started to appear when the cluster was migrated to SLURM.

Command used and terminal output

#!/bin/bash
#
#SBATCH -p all # partition (queue)
#SBATCH -c 1 # number of cores
#SBATCH --mem 16G # memory pool for all cores
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

module load java/x86_64/16.0.1+9
module load nextflow/x86_64/23.04.1

nextflow -c atac-seq-slurm.config run nf-core/atacseq \
         -profile singularity \
         -params-file atac-seq.yaml \
         --save_align_intermeds \
         -resume

ERROR ~ Error executing process > 'NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX (CONTROL_REP1)'

Caused by:
  Process `NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX (CONTROL_REP1)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  samtools \
      index \
      -@ 1 \
       \
      CONTROL_REP1.mLb.mkD.sorted.bam

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX":
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Relevant files

Nextflow log file: nextflow.log
SLURM output: slurm.cyclone8.45837.txt

The config file only sets the working directory and the SLURM executor:

workDir = '/scratch/...'

executor {
    name = 'slurm'
}

The parameter file contains these settings:

input: './samplesheet.csv'
fasta: 'data/ath.fasta'
gff: 'data/ath.gff'
outdir: './results'

aligner: bowtie2
macs_gsize: 119481543
narrow_peak: true

max_cpus: 24
max_memory: '100.GB'

System information

Nextflow version: 23.04.1
Hardware: HPC
Executor: SLURM
Container engine: Singularity
OS: Linux
version of nf-core/ataseq: 2.1.2

nf-core / atacseq