sylabs / singularity

SingularityCE is the Community Edition of Singularity, an open source container platform designed to be simple, fast, and secure.
https://sylabs.io/docs/
Other
761 stars 98 forks source link

Investigate CRIU based checkpoint / restore #526

Open dtrudg opened 2 years ago

dtrudg commented 2 years ago

Is your feature request related to a problem? Please describe. It is not possible to checkpoint and restore running SingularityCE containers using in-built or simple methods.

Describe the solution you'd like It should be possible to checkpoint and restore batch and interactive containers that are launched with singularity run/shell/exec.

The solution should be similar to what is possible with podman and docker, so it is familiar to users working in a mixed OCI and Singularity environment.

See: https://criu.org/Podman

Describe alternatives you've considered The apptainer project has implemented checkpoint/restore of instances only, using DMTCP: https://github.com/apptainer/apptainer/pull/109

This is definitely useful for checkpointing instances. However:

On the plus side for DMTCP over CRIU - it supports some HPC relevant concepts that CRIU does not, or did not when the comparison was last updated:

https://criu.org/Comparison_to_other_CR_projects

olifre commented 2 years ago

On the plus side for DMTCP over CRIU - it supports some HPC relevant concepts that CRIU does not, or did not when the comparison was last updated [...]

I think this comparison was focussing on off-the-shelf integration with MPI and other libraries. Since CRIU checkpoints from "outside" the process, there is no direct integration (yet). However, real-life testing has already provden CRIU can be used to checkpoint multi-node MPI applications: http://urn.kb.se/resolve?urn=urn%3Anbn%3Ase%3Ahv%3Adiva-11645 It requires coordination by an outside tool (e.g. a batch system) though.