Open dtrudg opened 2 years ago
On the plus side for DMTCP over CRIU - it supports some HPC relevant concepts that CRIU does not, or did not when the comparison was last updated [...]
I think this comparison was focussing on off-the-shelf integration with MPI and other libraries. Since CRIU checkpoints from "outside" the process, there is no direct integration (yet). However, real-life testing has already provden CRIU can be used to checkpoint multi-node MPI applications: http://urn.kb.se/resolve?urn=urn%3Anbn%3Ase%3Ahv%3Adiva-11645 It requires coordination by an outside tool (e.g. a batch system) though.
Is your feature request related to a problem? Please describe. It is not possible to checkpoint and restore running SingularityCE containers using in-built or simple methods.
Describe the solution you'd like It should be possible to checkpoint and restore batch and interactive containers that are launched with
singularity run/shell/exec
.The solution should be similar to what is possible with
podman
anddocker
, so it is familiar to users working in a mixed OCI and Singularity environment.See: https://criu.org/Podman
Describe alternatives you've considered The apptainer project has implemented checkpoint/restore of instances only, using DMTCP: https://github.com/apptainer/apptainer/pull/109
This is definitely useful for checkpointing instances. However:
run/exec
batch jobs or interactiveshell
tasks, rather than instances. We anticipate most use of instances would be for persistent services which are likely to be able to maintain state themselves, across shutdown/startup./proc/n/
) might not be the process real PID on resume, wrapped getpid() will return the original, but using/proc/
directly. This may be old information but needs to be investigated and understood. EDIT - this appears to be out of date: https://github.com/dmtcp/dmtcp/issues/461On the plus side for DMTCP over CRIU - it supports some HPC relevant concepts that CRIU does not, or did not when the comparison was last updated:
https://criu.org/Comparison_to_other_CR_projects