icbi-lab / nextNEOpi

nextNEOpi: a comprehensive pipeline for computational neoantigen prediction
Other
67 stars 24 forks source link

Problem with CNNScoreVariants #20

Closed haraldgrove closed 1 year ago

haraldgrove commented 1 year ago

I just tried to run the workflow on a WGS dataset and got an error message from two of the CNNScore tasks (out of 40 total). I tried to verify the problem by running the failed task directly, both with the singularity image from nextNEOpi and with a local docker image of GATK, but both of them finished without any errors.

Any idea what might be happening here? Error message:

Error executing process > 'CNNScoreVariants (P01)'

Caused by:                                                                                                                                                                                                  Process `CNNScoreVariants (P01)` terminated with an error exit status (3)                                                                                                                                                                                                                                                                                                                                         Command executed:                                                                                                                                                                                                                                                                                                                                                                                                     mkdir -p /gnome/harald/2022/neoantigens/analysis_results/nextneopi_WGS_hg38_nextflow/tmp
                                                                                                                                                                                                            gatk CNNScoreVariants \                                                                                                                                                                                       --tmp-dir /gnome/harald/2022/neoantigens/analysis_results/nextneopi_WGS_hg38_nextflow/tmp \                                                                                                               -R GRCh38.d1.vd1.fa \
      -I P01_normal_DNA_recalibrated.bam \
      -V P01_germline_0013-scattered.interval_list.vcf.gz \
      -tensor-type read_tensor \                                                                                                                                                                                --inter-op-threads 2 \
      --intra-op-threads 2 \
      --transfer-batch-size 256 \                                                                                                                                                                               --inference-batch-size 128 \
      -O P01_germline_0013-scattered.interval_list.vcf_CNNScored.vcf.gz                                                                                                                                                                                                                                                                                                                                             Command exit status:                                                                                                                                                                                        3

Command output:
  (empty)

Command error:
        ... 11 more                                                                                                                                                                                         Caused by: org.broadinstitute.hellbender.exceptions.GATKException: Expected message of length 3 but only found 0 bytes                                                                                          at org.broadinstitute.hellbender.utils.runtime.StreamingProcessController.getBytesFromStream(StreamingProcessController.java:267)                                                                         at org.broadinstitute.hellbender.utils.runtime.StreamingPro 2.27.1
  03:39:59.647 INFO  CNNScoreVariants - Built for Spark Version: 2.4.5
  03:39:59.647 INFO  CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
  03:39:59.648 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false                                                                                                              03:39:59.648 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
  03:39:59.648 INFO  CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false                                                                                                              03:39:59.648 INFO  CNNScoreVariants - Deflater: IntelDeflater
  03:39:59.649 INFO  CNNScoreVariants - Inflater: IntelInflater
  03:39:59.649 INFO  CNNScoreVariants - GCS max retries/reopens: 20
  03:39:59.649 INFO  CNNScoreVariants - Requester pays: disabled
   03:39:59.649 INFO  CNNScoreVariants - Initializing engine
  03:40:03.900 INFO  FeatureManager - Using codec VCFCodec to read file file://P01_germline_0013-scattered.interval_list.vcf.gz
  03:40:05.154 INFO  CNNScoreVariants - Done initializing engine
  03:40:05.157 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
  03:40:14.233 INFO  CNNScoreVariants - Using key:CNN_2D for CNN architecture:/gnome/harald/2022/neoantigens/analysis_results/nextneopi_WGS_hg38_nextflow/tmp/small_2d.16228056869151426233.json and weights:/gnome/harald/2022/neoantigens/analysis_results/nextneopi_WGS_hg38_nextflow/tmp/small_2d.5079293164478269062.hd5
  03:40:14.705 INFO  CNNScoreVariants - Done scoring variants with CNN.
  03:40:14.706 INFO  CNNScoreVariants - Shutting down engine
  [January 26, 2023 at 3:40:14 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.CNNScoreVariants done. Elapsed time: 0.29 minutes.
  Runtime.totalMemory()=3812622336
  org.broadinstitute.hellbender.exceptions.GATKException: Exception waiting for ack from Python: org.broadinstitute.hellbender.exceptions.GATKException: Expected message of length 3 but only found 0 bytes
        at org.broadinstitute.hellbender.utils.runtime.StreamingProcessController.waitForAck(StreamingProcessController.java:239)
        at org.broadinstitute.hellbender.utils.python.StreamingPythonScriptExecutor.waitForAck(StreamingPythonScriptExecutor.java:216)
        at org.broadinstitute.hellbender.utils.python.StreamingPythonScriptExecutor.sendSynchronousCommand(StreamingPythonScriptExecutor.java:183)
        at org.broadinstitute.hellbender.tools.walkers.vqsr.CNNScoreVariants.initializePythonArgsAndModel(CNNScoreVariants.java:557)
        at org.broadinstitute.hellbender.tools.walkers.vqsr.CNNScoreVariants.onTraversalStart(CNNScoreVariants.java:317)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1083)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
  Caused by: java.util.concurrent.ExecutionException: org.broadinstitute.hellbender.exceptions.GATKException: Expected message of length 3 but only found 0 bytes
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.broadinstitute.hellbender.utils.runtime.StreamingProcessController.waitForAck(StreamingProcessController.java:234)
        ... 11 more
  Caused by: org.broadinstitute.hellbender.exceptions.GATKException: Expected message of length 3 but only found 0 bytes
        at org.broadinstitute.hellbender.utils.runtime.StreamingProcessController.getBytesFromStream(StreamingProcessController.java:267)
        at org.broadinstitute.hellbender.utils.runtime.StreamingProcessController.lambda$waitForAck$0(StreamingProcessController.java:214)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
  Using GATK jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar CNNScoreVariants --tmp-dir /gnome/harald/2022/neoantigens/analysis_results/nextneopi_WGS_hg38_nextflow/tmp -R GRCh38.d1.vd1.fa -I P01_normal_DNA_recalibrated.bam -V P01_germline_0013-scattered.interval_list.vcf.gz -tensor-type read_tensor --inter-op-threads 2 --intra-op-threads 2 --transfer-batch-size 256 --inference-batch-size 128 -O P01_germline_0013-scattered.interval_list.vcf_CNNScored.vcf.gz

Work dir:
  /gnome/harald/2022/neoantigens/analysis_results/nextneopi_WGS_hg38_nextflow/41/58b884e0a616d7e02cc9dd1a5d0d27

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Also, when I tried to rerun the whole job (with -resume and without making any changes to the any of the inputs), the process started at the beginning of the DNA alignment step. Since the log indicated that the Mutect2 step was finished, I assumed it would be able to use the bam files from the previous run?

-Harald

riederd commented 1 year ago

Hi, thanks for your interest in nextNEOpi. Unfortunately, I can not comment about the GATK error you got. So far I have not seen it before and I also couldn't find any useful information on the net.

What happens if you retry manually by doing:

cd /gnome/harald/2022/neoantigens/analysis_results/nextneopi_WGS_hg38_nextflow/41/58b884e0a616d7e02cc9dd1a5d0d27
bash .command.run

If it completes the -resume option should work. But sometimes for whatever reason nextflow doesn't resume at the supposedly last successfully finished process.

haraldgrove commented 1 year ago

Hi

When I manually ran bash .command.run in the indicated folder, it finished without any issue. At this point it doesn't feel like a GATK issue, but rather some weird interaction between my data and the nextflow scripts. (As a side note, I have successfully run a WES data set through this part of the pipeline, so I think the install should be ok.)

Unfortunately, when I tried to resume the workflow, it started with the "ScatteredIntervalListToBed, make_uBAM, Bwa " processes. Seemingly not recognizing that the previous run had continued beyond that. I'll have to wait and see if it stops at the CNNscore part again.

riederd commented 1 year ago

Hmmm... this gets difficult to debug/reproduce here. Some time ago we hit an issue at the interval list creation reported by another user who was also analyzing WGS data. We made a hotfix patch, that will be included in the next release. I'm not sure if this would also help with the resume, however I wouldn't hurt:

nextNEOpi_hotfix_20221215.patch.gz

The manual bash .command.run is exactly doing what nextflow would do when it runs the process, so I still think it might be something (hope fully transient) with GATK. Maybe you still have the .nextflow.log.[x] from that failed run, so I may check if I can spot something more there. Also a gz archive of that directory (if it fails again) would be helping in finding the root of the issue.

Thanks

haraldgrove commented 1 year ago

I managed to get the pipeline finished by setting the scatter_count to 1. However, that didn't help with the resume functionality, it still starts from before the BAM creation, seemingly at the SplitIntervals (SplitIntervals) step. I managed to delete the previous log files, but if I see the error again, I can provide the log file in case you think you can find anything.

riederd commented 1 year ago

3f34b81da8c155e9b0f905a95aaca9c06710bb4b should resolve the resume behavior.

Feel free to open a new ticket in case v1.4.0 of nextNEOpi still fails