nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.72k stars 623 forks source link

Support for Java checkpoint mechanism #4291

Open pditommaso opened 1 year ago

pditommaso commented 1 year ago

Modern Java VM provides a checkpoint mechanism that allows the interruption of a running VM, persists the execution to a file and continues the execution from the persisted state.

This can be useful to allow the use of preemptible cloud VMs for the execution of Nextflow pipeline, so that when the system reclaims the VM, the current Nextflow execution can be stopped and restarted to a new VM.

Currently, this feature requires the use of vendor-specific SDK, such as Zulu Java CRaC SDK.

The goal of this issue is to explore the possibility to support this capability in the Nextflow runtime.

Useful links

pditommaso commented 1 year ago

First exploration:

  1. install the Zulu SDK using the command sdk install java 17.0.8.crac-zulu (only linux is supported)
  2. the -enable-checkpoint flag was added to the launch.sh script; see ac4cc10e
  3. launch a nextflow run using the command

    NXF_CLOUDCACHE_PATH=s3://nextflow-ci/cache ./launch.sh -enable-checkpoint run pditommaso/nf-sleep --timeout 300` 
  4. find out the PID of the nextflow process
  5. create the snapshot file using the command jcmd <PID> JDK.checkpoint

the following error is reported

An exception during a checkpoint operation:
jdk.internal.crac.CheckpointException
    at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
    at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
    at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)
    Suppressed: jdk.internal.crac.impl.CheckpointOpenFileException: /home/paolo/nextflow/.nextflow.log
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
    Suppressed: jdk.internal.crac.impl.CheckpointOpenSocketException: tcp6 localAddr ::ffff:172.31.37.61 localPort 49424 remoteAddr ::ffff:140.82.121.4 remotePort 443
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:91)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
    Suppressed: jdk.internal.crac.impl.CheckpointOpenSocketException: tcp6 localAddr ::ffff:172.31.37.61 localPort 35156 remoteAddr ::ffff:52.92.17.218 remotePort 443
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:91)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
    Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[14369651]
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
    Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[14369652]
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
pditommaso commented 1 year ago

Made some progress here adding the preventing the CheckpointOpenFileException for the log file via 0fe19b22.

Still the following error is reported

An exception during a checkpoint operation:
jdk.internal.crac.CheckpointException
    at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
    at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
    at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)
    Suppressed: jdk.internal.crac.impl.CheckpointOpenSocketException: tcp6 localAddr ::ffff:172.31.37.61 localPort 49744 remoteAddr ::ffff:3.5.71.249 remotePort 443
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:91)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
    Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[22631398]
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
    Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[22631399]
        at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
        at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
        ... 2 more
pditommaso commented 1 year ago

It looks like the S3 client keeps a socket connection opened, causing the above error

emyr666 commented 10 months ago

have you looked at DMTCP ?