Open pditommaso opened 1 year ago
First exploration:
sdk install java 17.0.8.crac-zulu
(only linux is supported) -enable-checkpoint
flag was added to the launch.sh
script; see ac4cc10elaunch a nextflow run using the command
NXF_CLOUDCACHE_PATH=s3://nextflow-ci/cache ./launch.sh -enable-checkpoint run pditommaso/nf-sleep --timeout 300`
jcmd <PID> JDK.checkpoint
the following error is reported
An exception during a checkpoint operation:
jdk.internal.crac.CheckpointException
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)
Suppressed: jdk.internal.crac.impl.CheckpointOpenFileException: /home/paolo/nextflow/.nextflow.log
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
Suppressed: jdk.internal.crac.impl.CheckpointOpenSocketException: tcp6 localAddr ::ffff:172.31.37.61 localPort 49424 remoteAddr ::ffff:140.82.121.4 remotePort 443
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:91)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
Suppressed: jdk.internal.crac.impl.CheckpointOpenSocketException: tcp6 localAddr ::ffff:172.31.37.61 localPort 35156 remoteAddr ::ffff:52.92.17.218 remotePort 443
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:91)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[14369651]
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[14369652]
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
Made some progress here adding the preventing the CheckpointOpenFileException for the log file via 0fe19b22.
Still the following error is reported
An exception during a checkpoint operation:
jdk.internal.crac.CheckpointException
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141)
at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246)
at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262)
Suppressed: jdk.internal.crac.impl.CheckpointOpenSocketException: tcp6 localAddr ::ffff:172.31.37.61 localPort 49744 remoteAddr ::ffff:3.5.71.249 remotePort 443
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:91)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[22631398]
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
Suppressed: jdk.internal.crac.impl.CheckpointOpenResourceException: pipe:[22631399]
at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:97)
at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145)
... 2 more
It looks like the S3 client keeps a socket connection opened, causing the above error
have you looked at DMTCP ?
Modern Java VM provides a checkpoint mechanism that allows the interruption of a running VM, persists the execution to a file and continues the execution from the persisted state.
This can be useful to allow the use of preemptible cloud VMs for the execution of Nextflow pipeline, so that when the system reclaims the VM, the current Nextflow execution can be stopped and restarted to a new VM.
Currently, this feature requires the use of vendor-specific SDK, such as Zulu Java CRaC SDK.
The goal of this issue is to explore the possibility to support this capability in the Nextflow runtime.
Useful links