icbi-lab / nextNEOpi

nextNEOpi: a comprehensive pipeline for computational neoantigen prediction
Other
65 stars 23 forks source link

make_uBAM process error #49

Closed gri11 closed 9 months ago

gri11 commented 10 months ago

At first, I can run with testdata provided on this repo, but when use with my data (Normal ~12 GB, Tumor ~60 GB) got this error in process make_uBAM.

Sep-24 04:18:10.026 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'make_uBAM (sample1 : tumor_DNA)'

Caused by:
  Process `make_uBAM (sample1 : tumor_DNA)` terminated with an error exit status (247)

Command executed:

  mkdir -p /tmp/ubuntu/nextNEOpi
  gatk --java-options "-Xmx64G -XX:ParallelGCThreads=1" FastqToSam \
      --TMP_DIR /tmp/ubuntu/nextNEOpi \
      --MAX_RECORDS_IN_RAM 2097152 \
      -F1 72379_QS2023-002_50GB_1.fq.gz -F2 72379_QS2023-002_50GB_2.fq.gz \
      --READ_GROUP_NAME sample1_tumor \
      --SAMPLE_NAME sample1_tumor \
      --LIBRARY_NAME sample1_tumor \
      --PLATFORM ILLUMINA \
      -O sample1_tumor_DNA_unaligned.bam

Command exit status:
  247

Command output:
  (empty)

Command error:
  INFO  2023-09-24 04:15:18 FastqToSam  Processed     8,000,000 records.  Elapsed time: 00:00:47s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:15:28 FastqToSam  Processed     9,000,000 records.  Elapsed time: 00:00:57s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:30 FastqToSam  Processed    10,000,000 records.  Elapsed time: 00:00:59s.  Time for last 1,000,000:    2s.  Last read position: */*
  INFO  2023-09-24 04:15:40 FastqToSam  Processed    11,000,000 records.  Elapsed time: 00:01:09s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:43 FastqToSam  Processed    12,000,000 records.  Elapsed time: 00:01:12s.  Time for last 1,000,000:    2s.  Last read position: */*
  INFO  2023-09-24 04:15:53 FastqToSam  Processed    13,000,000 records.  Elapsed time: 00:01:22s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:56 FastqToSam  Processed    14,000,000 records.  Elapsed time: 00:01:25s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:05 FastqToSam  Processed    15,000,000 records.  Elapsed time: 00:01:35s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:09 FastqToSam  Processed    16,000,000 records.  Elapsed time: 00:01:38s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:18 FastqToSam  Processed    17,000,000 records.  Elapsed time: 00:01:47s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:21 FastqToSam  Processed    18,000,000 records.  Elapsed time: 00:01:50s.  Time for last 1,000,000:    2s.  Last read position: */*
  INFO  2023-09-24 04:16:30 FastqToSam  Processed    19,000,000 records.  Elapsed time: 00:02:00s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:34 FastqToSam  Processed    20,000,000 records.  Elapsed time: 00:02:03s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:43 FastqToSam  Processed    21,000,000 records.  Elapsed time: 00:02:13s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:47 FastqToSam  Processed    22,000,000 records.  Elapsed time: 00:02:17s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:51 FastqToSam  Processed    23,000,000 records.  Elapsed time: 00:02:20s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:17:01 FastqToSam  Processed    24,000,000 records.  Elapsed time: 00:02:30s.  Time for last 1,000,000:   10s.  Last read position: */*
  INFO  2023-09-24 04:17:05 FastqToSam  Processed    25,000,000 records.  Elapsed time: 00:02:34s.  Time for last 1,000,000:  ecords.  Elapsed time: 00:00:03s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:14:37 FastqToSam  Processed     2,000,000 records.  Elapsed time: 00:00:06s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:14:49 FastqToSam  Processed     3,000,000 records.  Elapsed time: 00:00:18s.  Time for last 1,000,000:   12s.  Last read position: */*
  INFO  2023-09-24 04:14:53 FastqToSam  Processed     4,000,000 records.  Elapsed time: 00:00:22s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:15:02 FastqToSam  Processed     5,000,000 records.  Elapsed time: 00:00:31s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:05 FastqToSam  Processed     6,000,000 records.  Elapsed time: 00:00:34s.  Time for last 1,000,000:    2s.  Last read position: */*
  INFO  2023-09-24 04:15:14 FastqToSam  Processed     7,000,000 records.  Elapsed time: 00:00:44s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:18 FastqToSam  Processed     8,000,000 records.  Elapsed time: 00:00:47s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:15:28 FastqToSam  Processed     9,000,000 records.  Elapsed time: 00:00:57s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:30 FastqToSam  Processed    10,000,000 records.  Elapsed time: 00:00:59s.  Time for last 1,000,000:    2s.  Last read position: */*
  INFO  2023-09-24 04:15:40 FastqToSam  Processed    11,000,000 records.  Elapsed time: 00:01:09s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:43 FastqToSam  Processed    12,000,000 records.  Elapsed time: 00:01:12s.  Time for last 1,000,000:    2s.  Last read position: */*
  INFO  2023-09-24 04:15:53 FastqToSam  Processed    13,000,000 records.  Elapsed time: 00:01:22s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:15:56 FastqToSam  Processed    14,000,000 records.  Elapsed time: 00:01:25s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:05 FastqToSam  Processed    15,000,000 records.  Elapsed time: 00:01:35s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:09 FastqToSam  Processed    16,000,000 records.  Elapsed time: 00:01:38s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:18 FastqToSam  Processed    17,000,000 records.  Elapsed time: 00:01:47s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:21 FastqToSam  Processed    18,000,000 records.  Elapsed time: 00:01:50s.  Time for last 1,000,000:    2s.  Last read position: */*
  INFO  2023-09-24 04:16:30 FastqToSam  Processed    19,000,000 records.  Elapsed time: 00:02:00s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:34 FastqToSam  Processed    20,000,000 records.  Elapsed time: 00:02:03s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:43 FastqToSam  Processed    21,000,000 records.  Elapsed time: 00:02:13s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:16:47 FastqToSam  Processed    22,000,000 records.  Elapsed time: 00:02:17s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:16:51 FastqToSam  Processed    23,000,000 records.  Elapsed time: 00:02:20s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:17:01 FastqToSam  Processed    24,000,000 records.  Elapsed time: 00:02:30s.  Time for last 1,000,000:   10s.  Last read position: */*
  INFO  2023-09-24 04:17:05 FastqToSam  Processed    25,000,000 records.  Elapsed time: 00:02:34s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:17:14 FastqToSam  Processed    26,000,000 records.  Elapsed time: 00:02:43s.  Time for last 1,000,000:    9s.  Last read position: */*
  INFO  2023-09-24 04:17:18 FastqToSam  Processed    27,000,000 records.  Elapsed time: 00:02:47s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:17:30 FastqToSam  Processed    28,000,000 records.  Elapsed time: 00:02:59s.  Time for last 1,000,000:   11s.  Last read position: */*
  INFO  2023-09-24 04:17:33 FastqToSam  Processed    29,000,000 records.  Elapsed time: 00:03:02s.  Time for last 1,000,000:    3s.  Last read position: */*
  INFO  2023-09-24 04:17:42 FastqToSam  Processed    30,000,000 records.  Elapsed time: 00:03:11s.  Time for last 1,000,000:    8s.  Last read position: */*
  Using GATK jar /opt/gatk/gatk-package-4.4.0.0-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx64G -XX:ParallelGCThreads=1 -jar /opt/gatk/gatk-package-4.4.0.0-local.jar FastqToSam --TMP_DIR /tmp/ubuntu/nextNEOpi --MAX_RECORDS_IN_RAM 2097152 -F1 72379_QS2023-002_50GB_1.fq.gz -F2 72379_QS2023-002_50GB_2.fq.gz --READ_GROUP_NAME sample1_tumor --SAMPLE_NAME sample1_tumor --LIBRARY_NAME sample1_tumor --PLATFORM ILLUMINA -O sample1_tumor_DNA_unaligned.bam

Work dir:
  /home/ubuntu/nextNEOpi.1.4.0/work/48/c4158935efd3a22b4a39030ccccc3d

Is there any way to fix this error? Computer Spec:

Config:

withName:make_uBAM {
        beforeScript = "ulimit -n 4096 -m unlimited"
}
// Picard common options
maxRecordsInRam = "2097152"
maxRecordsInRamMerge = "2097152"
riederd commented 10 months ago

you can try to set lower memory for Java in conf/parameters.conf around lin 137:

JAVA_Xmx = "-Xmx32G"

fredsamhaak commented 10 months ago

Hi @riederd, I run the pipeline based on a larger dataset (Normal ~12G, Tumor ~15G) and use the same spec settings as @gri11 and set JAVA_Xmx = "-Xmx32G" as you said but run into the same problem: 247(Command exit status). (Computer spec: 128GBs RAM.)

bash .command.run can be implemented successfully in the work dir where the error occurred but seems like when I run nextflow run nextNEOpi-master/nextNEOpi.nf ... -resume again, it does not escape the make_ubam step and still run into the same problem.

Do you have any advice for this? Thanks in advance!

All the best, He

riederd commented 10 months ago

@fredsamhaak Can you post the contents of the work dir as tar.gz?

Are you running the pipeline on slurm? If yes, are there any memory limits set?

fredsamhaak commented 10 months ago

Hi, I am not running the pipeline on slurm. By the way, I want to use nohup *.sh & to run the pipeline but it will stop soon (but that's another question, maybe we could talk it later) so I just simply run it to first check if everything will be ok with a larger dataset.

Here is the work dir: work.tar.gz

The *unaligned.bam file is too large so I exclude it: 截屏2023-09-26 16 53 02

-- Update: I try the pipeline again and use nextflow run nextNEOpi-master/nextNEOpi.nf ... -profile singularity -resume this time (instead of -profile singularity,cluster). During the running process, I use top to check the memory used by this step, I find that more than ~125G is used and the pipeline shows that: 截屏2023-09-26 17 17 57

I check work/88/32d24ebff687352f3bb6f61377b315, the process runs successfully: 截屏2023-09-26 17 25 35 截屏2023-09-26 17 31 58

riederd commented 10 months ago

These are contents of a work directory in which the process was running successfully. I'd need the ones where it failed with 247.

Can you post an output of top so that I can see which process is using 125GB of memory? It should not happen that a single process would need that much.

fredsamhaak commented 10 months ago

Hi, here is the work dir: work.247.tar.gz

I also attach the 137 work dir since seems like this error also occurred when the memory reach ~125GB:

截屏2023-09-26 17 17 57 work.137.tar.gz

-- And here are the screenshots of the output of top: 1: 截屏2023-09-27 09 59 36

2: 截屏2023-09-27 10 01 55

riederd commented 10 months ago

Thanks for the info:

can you try to lower the memory settings for sambamba inconf/params.config around line 146: SB_sort_mem = "16G"

fredsamhaak commented 10 months ago

Thanks @riederd, it works (make_uBAM step done successfully). But it is a little bit "dangerous" because the used memory seems like nearly reached 125G (checked by top). Does the parameter JAVA_Xmx matters? And is it ok to set JAVA_Xmx -Xmx 16G?

I attach the nextNEOpi_trace.txt (I think maybe you need this file to check the memory used by each step?) and would you please explain the last four columns regarding to memory? Thank you~ nextNEOpi_trace.txt

All the best, He

riederd commented 9 months ago

Hi, I'm glad it worked.

the JAVA_Xmx sets the maximum amount or memory the java virtual machine (basically the java process) can consume. You can try to lower it, I guess 16G might work but it might increase the compute time for some processes.

As for the last four columns of the trace file:

peak_rss: Peak of real memory (Resident Set Size), basically the real memory used by the process peak_vmem: Peak of virtual memory (Virtual Memory), the total amount of memory used by a process is the virtual memory (vmem). The vmem contains all memory areas whether they are in the physical memory (RAM), in the Swap space, on the disk or shared with other processes. rchar: Number of bytes the process read wchar: Number of bytes the process wrote

see also: https://www.nextflow.io/docs/latest/metrics.html#memory-usage https://nextflow.io/docs/edge/tracing.html#trace-report

You can also limit the maximum number of processes the pipeline is running in parallel, so that the different processes are not consuming all of your memory. Use the maxForks setting of nextflow to do so, e.g. in

conf/process.config around line 10 add:

maxForks = 1

or whatever number is good for you.

see also: https://www.nextflow.io/docs/latest/process.html#maxforks

If this answers your questions, you I may close this issue, right?

fredsamhaak commented 9 months ago

Thanks for your detailed explanation.

So from the nextNEOpi_trace.txt I attached before, is it because the memory used by the four programs framed in red box exceeded more than ~128G (138.7 = 40.4 + 40.4 + 28.9 + 29) so that the pipeline failed with 247? 截屏2023-09-27 17 09 06

I will set maxForks = 1 to run the pipeline sequentially and check if it works (and also the running time.)

riederd commented 9 months ago

Yeah looks like, but you need to sum up the peak_rss values, and I guess there are also other processes running on the system that need some memory.

I think setting maxForks = 1 globally will prevent running out of memory on your system, however it won't be needed for all the processes in the pipeline. You can finetune this in conf/process.config and limit only the processes that consume a large amount of memory, e.g.:

....
withName:Bwa {
    maxForks = 1
}

withName:Neofuse {
    maxForks = 1
}
....

and as discussed above lowering the memory settings for java and sambamba in conf/params.conf should help as well, and is probably sufficient e.g.:

JAVA_Xmx = "-Xmx16G"
SB_sort_mem = "16G"
fredsamhaak commented 9 months ago

That's great! I will try setting maxForks =1 just for Bwa and make_uBAM. And maybe I will also try lowering the memory settings for java and sambamba. Thanks for your patience and detailed explanation.

Have a nice day! He

fredsamhaak commented 9 months ago

Hi @riederd, sorry for the late response since I just came back from vacation.

Setting maxForks = 1 for Bwa and make_uBAM works for me, but I ran into another problem: step Neofuse has run for more than 8 days and it is still running. Here is the log file: nextflow.log

Thanks again!