nf-core / atacseq

ATAC-seq peak-calling and QC analysis pipeline
https://nf-co.re/atacseq
MIT License
179 stars 116 forks source link

Process terminated for an unknown reason (SLURM) #339

Open hdbeukel opened 10 months ago

hdbeukel commented 10 months ago

Description of the bug

When running the atac-seq pipeline on our SLURM cluster, it keeps failing at seemingly arbitrary points, with an error message saying that the process was "terminated for an unknown reason -- Likely it has been terminated by the external system" (see full error below).

When resuming the pipeline, without any changes in parameters or anything, it usually does get past the previously terminated process and then fails again at a later step, with the same error message. If I keep resuming the pipeline, eventually it does reach the end.

When a process fails, the working directory contains only two files:

No .out, .trace, .exitcode, ... and also no symlinks to the input data have been created. If a manually submit the .command.run script to the cluster, without making any changes, it succeeds without any problem and all the files are there.

I have been in touch with our IT support in charge of managing the cluster but they also have no clue what is happening. We used to have a Sun Grid Engine cluster, on which the pipeline ran without problems. The issue started to appear when the cluster was migrated to SLURM.

Command used and terminal output

#!/bin/bash
#
#SBATCH -p all # partition (queue)
#SBATCH -c 1 # number of cores
#SBATCH --mem 16G # memory pool for all cores
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

module load java/x86_64/16.0.1+9
module load nextflow/x86_64/23.04.1

nextflow -c atac-seq-slurm.config run nf-core/atacseq \
         -profile singularity \
         -params-file atac-seq.yaml \
         --save_align_intermeds \
         -resume

ERROR ~ Error executing process > 'NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX (CONTROL_REP1)'

Caused by:
  Process `NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX (CONTROL_REP1)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  samtools \
      index \
      -@ 1 \
       \
      CONTROL_REP1.mLb.mkD.sorted.bam

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX":
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Relevant files

The config file only sets the working directory and the SLURM executor:

workDir = '/scratch/...'

executor {
    name = 'slurm'
}

The parameter file contains these settings:

input: './samplesheet.csv'
fasta: 'data/ath.fasta'
gff: 'data/ath.gff'
outdir: './results'

aligner: bowtie2
macs_gsize: 119481543
narrow_peak: true

max_cpus: 24
max_memory: '100.GB'

System information

JoseEspinosa commented 2 months ago

Sorry for the late reply @hdbeukel but this seems to be a memory issue (from your nextflow.log:

# There is insufficient memory for the Java Runtime Environment to continue.

You could try to increase the memory for the process by using a custom.config file:

process {
    withName: PICARD_MARKDUPLICATES {
           memory = 72.GB
    }
}

And then adding -c custom.config to your Nextflow command