Error `paired_sample_wgs:reheader_interval_bams`

Pipeline release version v7.2.0
Cluster you are using (SGE/Slurm-Dev/Slurm-Test) Slurm-Dev
Node type (F2s (lowmem) / F72s (midmem) / M64s (execute)) F72
Submission method (interactive/submission script) python
Actual submission script (python submission script, "nextflow run ...", etc.) .py script
Sbatch or qsub command and logs if applicable
Config files /hot/users/tgebo/WCDT/scripts/call-gSNP/DTB-005.config
Path to the working directory /hot/users/tgebo/pipelines/pipeline-call-gSNP
Any logs produced by the pipeline /hot/users/tgebo/pipelines/pipeline-call-gSNP/DTB-005.log

* Changed parameter from previous run in #49 back to default value: scatter_count = 50**

Error executing process > 'paired_sample_wgs:reheader_interval_bams:run_BuildBamIndex_Picard_normal (24)'

Caused by:
  Process `paired_sample_wgs:reheader_interval_bams:run_BuildBamIndex_Picard_normal (24)` terminated with an error exit status (134)

Command executed:

  set -euo pipefail
  java -Xmx1024m -Djava.io.tmpdir=/scratch         -jar /usr/local/share/picard-slim-2.26.8-0/picard.jar BuildBamIndex         -VALIDATION_STRINGEN
CY LENIENT         -INPUT DTB-005_DNA_N_recalibrated_reheadered_24.bam         -OUTPUT DTB-005_DNA_N_recalibrated_reheadered_24.bam.bai

Command exit status:
  134

Command output:
  (empty)

Command error:
  03:32:33.080 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/picard-slim-2.26.8-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
  [Wed Jan 05 03:32:33 GMT 2022] BuildBamIndex --INPUT DTB-005_DNA_N_recalibrated_reheadered_24.bam --OUTPUT DTB-005_DNA_N_recalibrated_reheadered_24.bam.bai --VALIDATION_STRINGENCY LENIENT --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
  [Wed Jan 05 03:32:33 GMT 2022] Executing as ?@9ba14d8de151 on Linux 3.10.0-1127.19.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.26.8
  runtime/cgo: pthread_create failed: Resource temporarily unavailable
  .command.run: line 273: 70764 Aborted                 docker run -i --cpus 1.0 --memory 1024m -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /scratch:/scratch -v "$PWD":"$PWD" -w "$PWD" --entrypoint /bin/bash -u $(id -u):$(id -g) $(for i in `id --real --groups`; do echo -n "--group-add=$i "; done) --volume /scratch:/scratch --name $NXF_BOXID blcdsdockerregistry/picard:2.26.8 -c "/bin/bash .command.run nxf_trace"

@yashpatel6 we can add some comments about the fix for the record.

I think the root cause may be related to max user processes, which is 4096 as default. I had a similar issue before with hatchet. (OpenBLAS and I had to add some extra env variables to adjust # threads) If we see this issue with different tools and want to increate the ulimit, we'll have to ask OHIA or we may need to adjust the number of intervals/jobs running at the same time.

See max user processes below.

(base) [tyamaguchi@ip-0A12521D CN_20]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15068
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 3145728
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@yashpatel6 we can add some comments about the fix for the record.

I think the root cause may be related to max user processes, which is 4096 as default. I had a similar issue before with hatchet. (OpenBLAS and I had to add some extra env variables to adjust # threads) If we see this issue with different tools and want to increate the ulimit, we'll have to ask OHIA or we may need to adjust the number of intervals/jobs running at the same time.

See max user processes below.
(base) [tyamaguchi@ip-0A12521D CN_20]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15068
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 3145728
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Got it, I think part of the reason is also the scratch space running out due to ApplyBQSR being parallelized and the pipelines having to wait for both Indel Realignment and BQSR to complete before deleting files. I've tried lowering the number of split intervals but the disk space issue causes the pipeline to fail so once I add the fix for processing the normal and tumour BQSR together, I'll test it again and see if the same issue pops up again.

uclahs-cds / pipeline-call-gSNP

Error `paired_sample_wgs:reheader_interval_bams` #50