Proposal: checkpoint image defintion

adrianreber commented 2 years ago

Background

Over the last couple of years we were working on CRIU based checkpointing. We introduced a simple export file format in Podman to easily move a complete checkpoint from one system to another. A complete checkpoint include the checkpoint files created by CRIU, the current container configuration (spec.dump) the content of the file-system diff against the orginal image, log files and metadata:

# tar tf /tmp/dump.tar 
artifacts/
devshm-checkpoint.tar
config.dump
spec.dump
network.status
stats-dump
checkpoint/
checkpoint/<criu-files>
rootfs-diff.tar

For the Kubernetes use cased we used this layout for the checkpoint archives we are creating in CRI-O (merged) and containerd (still under review). Also during the discussion around the KEP introducing checkpointing to Kubernetes (https://github.com/kubernetes/enhancements/issues/2008) containerd's OCI based checkpoint image was mentioned.

Saving the checkpoint data in an image that can be pushed to a registry made a lot of sense to us and we looked at the containerd checkpoint image. Unfortunately it includes containerd internal protobuf dumps which did not seem useful to have in checkpoint images.

So we created another image format by simply copying the tarball which we already have to an OCI image with some metadata. For CRI-O we currently use the following:

$ tree .
.
├── blobs
│   └── sha256
│       ├── 2cd9b96662de0ea7defcd950ca12d04e080c57781cbd1bcf26ce522ba8313daa
│       ├── 4c200251779966c00a1f3f5d9e3dc61d8e34e9392ce517d7fc1bc222d4d716de
│       └── dffe689f20affdad7c5a8b3a380abb7caa4c2d4c09a54dfa3f6885cc99fcca6f
├── index.json
└── oci-layout

2 directories, 5 files
$ cat index.json
{
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "digest": "sha256:2cd9b96662de0ea7defcd950ca12d04e080c57781cbd1bcf26ce522ba8313daa",
      "size": 563
    }
  ]
}
$ cat blobs/sha256/2cd9b96662de0ea7defcd950ca12d04e080c57781cbd1bcf26ce522ba8313daa
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:4c200251779966c00a1f3f5d9e3dc61d8e34e9392ce517d7fc1bc222d4d716de",
    "size": 326
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:dffe689f20affdad7c5a8b3a380abb7caa4c2d4c09a54dfa3f6885cc99fcca6f",
      "size": 2211259
    }
  ],
  "annotations": {
    "io.kubernetes.cri-o.annotations.checkpoint.name": "counter",
    "org.opencontainers.image.base.digest": "",
    "org.opencontainers.image.base.name": ""
  }
}

We are currently using buildah to create this kind of image:

$ newcontainer=$(buildah from scratch)
$ buildah add $newcontainer "/var/lib/kubelet/checkpoints/checkpoint-counters_default-counter-2022-10-13T12:42:38Z.tar" /
$ buildah config --annotation=io.kubernetes.cri-o.annotations.checkpoint.name=counter $newcontainer
$ buildah commit $newcontainer checkpoint-image:tag34
$ podman push localhost/checkpoint-image:tag34 quay.io/adrianreber/checkpoint-test:tag34

In Kubernetes we can now point to the image quay.io/adrianreber/checkpoint-test:tag34 and CRI-O, based on the annotations, will detect that a restore is required and not just a create/start.

At this point we can restore a checkpoint image created by Podman in CRI-O and we probably can make sure to understand the image created by containerd. But at his point we already have three slightly different implementations (still compatible (Podman, containerd (not merged) and CRI-O) of the same thing and one (the original from containerd) which is not compatible due to the use of containerd internal protobuf structures.

To avoid another image format containing checkpoint information we would like to propose a definition of what a checkpoint image should look like and we hope this is the right location for it.

Proposal

Over the last couple of years we have slowly added additional information to the checkpoint archive used by Podman, but for this proposal we want to start with the minimal set of information which we think would be important to have in an image. This is based on the current image we are using for Kubernetes in combination with CRI-O.

We would like to add following annotations to a checkpoint image so that it can be easily identified as such an image:

org.opencontainers.image.checkpoint.created date and time on which the checkpoint was created, conforming to RFC 3339
org.opencontainers.image.checkpoint.name name of the checkpointed container
org.opencontainers.image.checkpoint.checkpointer name of the underlying tool that created the checkpoint data (currently only criu)
org.opencontainers.image.checkpoint.runtime name of the runtime that created the checkpoint (we are currently aware of runc and crun that are able to create checkpoints. There is some support in youki. We are working to make sure that runc can restore crun checkpoints and vice-versa, but having this information in the annotations allows the consumer of this checkpoint image to decide early if it can restore this image without looking at the actual checkpoint content.)
org.opencontainers.image.checkpoint.engine something like containerd, Podman, CRI-O
org.opencontainers.image.checkpoint.base.digest see org.opencontainers.image.base.digest (this helps the consumer to early decide if it can restore the container on top of the mentioned image. Not sure if this is needed. Maybe just use org.opencontainers.image.base.digest)
org.opencontainers.image.checkpoint.base.name see org.opencontainers.image.base.name and org.opencontainers.image.checkpoint.base.digest
org.opencontainers.image.checkpoint.id the container ID of the original checkpointed container
org.opencontainers.image.checkpoint.version a checkpoint image version identifier (not sure, but maybe useful to easily track if this propsal changes or is amended.)

I am unsure if a custom media type is needed to describe the layer containing the actual checkpoint data. Looking at what we currently use

  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:dffe689f20affdad7c5a8b3a380abb7caa4c2d4c09a54dfa3f6885cc99fcca6f",
      "size": 2211259
    }
  ],

I am undecided. If we want to have additional information in the checkpoint we could just put it into this one layer. So far everything we worked with is put in tar archives or just plain files. Having multiple layers with the different content (rootfs diff, /dev/shm content) is a possibility but we can also just put it in one single layer.

Our proposal what would be in this layer is the following:

checkpoint. a directory with the actual checkpoint data created by the checkpointer mentioned in org.opencontainers.image.checkpoint.checkpointer. This would be something like checkpoint.criu.
config.json the container configuration as it was during checkpointing

From our work over the last few years creating such images we think this would be enough as a starting point.

I hope this is the right place to bring up such a request. We are open to (almost) any changes. The important part, from our point of view, is that we find a way to have a well defined layout for a checkpoint image because at this point it starts to get complicated to ensure that all involved tools can work together. From our point of view there is no technical reason that a checkpoint created with one tool (either runc, crun or containerd, CRI-O, Podman) should not work with the others. If we have seen problems so far it was always about missing or different metadata.

sudo-bmitch commented 2 years ago

Hi @adrianreber. From the description, I'm thinking an artifact containing the checkpoint, and a subject pointing back to the base image, might be a good model for this. That work only recently merged into main, and we're looking for users to prove out the design. (Note that artifact here includes both the new media type, and using the existing image manifest with a custom config media type, and you'll likely start with the latter for portability.)

estesp commented 2 years ago

@sudo-bmitch while I think that is interesting thought to consider the references work, I think that ties you to storing checkpoint images in the same registry where the original base image is located? If I understand the referrers API language correctly, the checkpoint image with the subject digest of the base image would have to be in the same repo namespace as the base image for the referrers API to even work properly. Given checkpointing is a downstream action by image consumers, who may or may not have a relationship to the image producer, this seems to be a potential mismatch with the reason for having subject/referrers. Happy to be corrected if I misunderstood the potential there.

sudo-bmitch commented 2 years ago

@estesp good catch. You're right, if you are spanning repositories or registries, the subject/referrers is a bad match.

vbatts commented 2 years ago

IIRC, folks have surfaced use-cases like this and it was decided that artifacts could span registries.

It's a bit crazy to think these would have to be on the same registry. Where the snapshot data may likely even just be on a local ephemeral registry for a quick relocation migration.

mikebrow commented 2 years ago

nod subject refers may be limited to local sha wrt. the referrers api (as is).. but we can use a different mechanism for base image manifest ref in a new artifact type (or new manifest type) .. you just would not get a list of checkpoints on a base image through the new referrers api (when the checkpoints are remote..).. or we can change up subject (refers) to also support a proxy/redirect pattern with the usual domain/image

rst0git commented 2 years ago

I'm thinking an artifact containing the checkpoint, and a subject pointing back to the base image

One thing that we need to consider with respect to the base image is that if the base image has changed (e.g., has been updated or removed) the associated checkpoints might not work anymore. For instance, CRIU restore might fail due to missing or changed files. Thus, it would be useful to have a mechanism that allows us to find all checkpoint images in a registry that are associated with specific base image.

sudo-bmitch commented 2 years ago

One thing that we need to consider with respect to the base image is that if the base image has changed (e.g., has been updated or removed) the associated checkpoints might not work anymore. For instance, CRIU restore might fail due to missing or changed files. Thus, it would be useful to have a mechanism that allows us to find all checkpoint images in a registry that are associated with specific base image.

That's where subject/referrers is useful, since you reference a specific digest, the base cannot change and the registry knows the relationship. Though in this case, that relationship is flipped, delete the base image, and untagged checkpoints would be eligible for GC. But this only works if checkpoints are limited to the same repository, which I don't think we want.

I don't think any solution would allow the registry to know there is a checkpoint in a different repository, and especially not in a different registry.

adrianreber commented 2 years ago

Thanks everyone for the feedback. Seems like this is the right place for this discussion.

I am not sure I can completely follow the discussion but artifacts sound like a good place for the checkpoint data. Trying to understand how this could work I looked at the https://github.com/opencontainers/image-spec/blob/main/artifact.md and just to make sure I understand it correctly, could a checkpoint image, using artifacts look like this:

{
  "mediaType": "application/vnd.oci.artifact.manifest.v1+json",
  "blobs": [
    {
      "mediaType": "application/gzip",
      "size": 12345,
      "digest": "sha256:12343725d74f4bfb94c9e86d64170f7521aad8221a5de834851470ca142da630"
    },
    {
      "mediaType": "application/json",
      "size": 123,
      "digest": "sha256:56783725d74f4bfb94c9e86d64170f7521aad8221a5de834851470ca142da630"
    }
  ],
  "annotations": {
    "org.opencontainers.image.checkpoint.created": "2022-10-19T14:42:55Z",
    "org.opencontainers.image.checkpoint.name":"<container name>",
    "org.opencontainers.image.checkpoint.checkpointer":"criu",
    "org.opencontainers.image.checkpoint.runtime":"runc",
    "org.opencontainers.image.checkpoint.engine":"cri-o",
    "org.opencontainers.image.checkpoint.base.digest":"sha256:afff3924849e458c5ef....d51",
    "org.opencontainers.image.checkpoint.base.name":"docker.io/library/alpine",
    "org.opencontainers.image.checkpoint.id":"3fc2f9bf82e9",
    "org.opencontainers.image.checkpoint.version":"1"
  }
}

One of the two blobs would be a compressed (application/gzip) tar archive containing the data created by CRIU and the JSON part would be the config.json during checkpointing. If, at some point, we need a second JSON file for additional metadata (I already have the idea for at least two possible extensions) how can the restorer know which blob contains which data. Is there a way in the artifact definition to give additional information about the blobs. I do not think subject can help here. It seems like it would need additional annotations pointing to the sha256 of the different blobs, right?

There was one sentence in the artifact definition which was a bit confusing:

Unlike OCI Images, OCI Artifacts are not meant to be used by any container runtime.

For the checkpoint image we would use it by the container runtime. Just wanted to make sure it is the right place.

Concerning referrers: I also think that the base image could be in any other registry and should be retrieved by the container engine using org.opencontainers.image.checkpoint.base.name and org.opencontainers.image.checkpoint.base.digest.

sudo-bmitch commented 2 years ago

For portability, I'd hold back on using the artifact manifest. We only just merged that and I don't think there are any public registries that have added support.

When pushing using the image manifest, you'll want to change the config media type from application/vnd.oci.image.config.v1+json to something specific to the checkpoint. If there is no group to put that under, then we can consider an appropriate OCI media type for this. Most registries allow this media type to vary, changing it is a signal to runtimes that it's not a typical OCI image with layers, and it aligns with the artifactType in the artifact manifest. That method of shipping an artifact is useful for backwards compatibility and is defined in https://github.com/opencontainers/artifacts.

Unlike OCI Images, OCI Artifacts are not meant to be used by any container runtime.

For the checkpoint image we would use it by the container runtime. Just wanted to make sure it is the right place.

I think the key differentiator is they modify how a runtime executes an image, rather than being an image themselves.

sftim commented 2 years ago

io.kubernetes.cri-o.annotations.checkpoint.name

Let's make sure that Kubernetes SIG Architecture is happy to endorse this use of a Kubernetes domain name.

sftim commented 2 years ago

Things I can imagine needing to track somehow:

the CPU architecture details for the system where the checkpoint took place (lots of code assumes that CPU feature bits don't change after probing)
the tag of the container image that was running when the snapshot was taken
the uptime of that container
the number of total CPU cores that the app believed were available, prior to the snapshot
the number of usable CPU cores that the app believed were available, prior to the snapshot
the start and end times for the checkpoint operation, with subsecond precision

adrianreber commented 1 year ago

How can this move forward? It it not really clear to me if the discussions about artifacts resulted in anything specific because I probably do not really understand the artifact discussions.

From my point of view if would be good to have a definition how to store a checkpoint image in a registry. A checkpoint image needs to include:

a couple of tar archives with the data from the checkpoint/rootfs diff/shm content/volumes
a couple of json files with the container and system configuration
additional metadata to easily identify the checkpoint without downloading the whole image

sftim commented 1 year ago

And some more questions

Do we want to define what mount points the snapshotted app would expect to find?
Do we want a defined way to express “this is a checkpoint of a running container, but as the snapshotter / publisher I assert that there's nothing confidential in its memory”

adrianreber commented 1 year ago

And some more questions

Do we want to define what mount points the snapshotted app would expect to find?

Yes, this is needed and would be one of json files I mentioned. In CRI-O we are currently tracking this.

Do we want a defined way to express “this is a checkpoint of a running container, but as the snapshotter / publisher I assert that there's nothing confidential in its memory”

Also a good idea.

hesch commented 1 year ago

Some questions regarding incremental checkpoints

Do we need any special metadata here to support them?
Do we want to separate the actual checkpoint data and the added metadata from CRI-O to "reuse" the old metadata?

IMO it is too early to specify anything for incremental checkpoints at this point. AFAIK we don't have any implementation that uses them yet. But we should keep this use-case in mind for a later extension of the specification.

rst0git commented 1 year ago

@hesch CRIU supports incremental checkpoints via the pre-dump command. This command creates a snapshot of the memory changes, and uses a symlink ("parent") to create a link to previous checkpoints. The dump command can then be used to create a checkpoint that includes a complete snapshot and links to the previous checkpoints. At restore time, CRIU is pointed to the path of the complete snapshot and uses the symlinks to find the content of memory pages that was captured in previous pre-dump iterations.

With checkpoint images, we currently include a complete snapshot that is stored in a single layer. To extend the current approach to support incremental checkpoints we could create an image with multiple layers, where each layer includes a snapshot of the memory changes, and the final layer includes the complete checkpoint.

Do we need any special metadata here to support them? Do we want to separate the actual checkpoint data and the added metadata from CRI-O to "reuse" the old metadata?

IMHO, the approach described above could enable incremental checkpoints without special metadata.

@adrianreber What do you think?

sftim commented 1 year ago

Extra metadata could be helpful: it lets an implementation spot that a snapshot isn't compatible (because the implementation only supports single layer checkpoints) with less effort.

That's especially relevant if early implementations are likely to miss out that support.

hesch commented 1 year ago

@rst0git I also think that approach would be good. If I understand this wiki page correctly, there is also the possibility to have incremental checkpoints with the full dump command and restore from different points in time. With the separation into layers we could then have many checkpoint images referencing the same memory page layers in the same registry.

adrianreber commented 1 year ago

Any recommendations how we can move this forward? It is not clear to me what the current situation is. It seems nobody is against it. We basically need something to put a couple of binary blobs into the image, a couple of JSON files and some additional metadata in the annotations. Any way we can get this defined?

tianouya-db commented 1 year ago

@adrianreber, have you considered packing the checkpoint data in an OCI artifact? I was experimenting it like this (taken with containerd + criu), and it was accepted by Harbor:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "artifactType": "application/vnd.containerd.container.checkpoint.config.v1+proto",
  "config": {
    "mediaType": "application/vnd.oci.scratch.v1+json",
    "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a",
    "size": 2
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:a5f4c4a8e63cf04411333c4f6508aba2c4e1eda3e6fbdc55a4fae3351995ad14",
      "size": 2632808
    },
    {
      "mediaType": "application/vnd.containerd.container.criu.checkpoint.criu.tar",
      "digest": "sha256:7186a394d2be9d2a75ba30b4042fe4337bf3104072f216e689cba14649628a50",
      "size": 3365628
    },
    {
      "mediaType": "application/vnd.containerd.container.checkpoint.config.v1+proto",
      "digest": "sha256:91a2a959b6d4e4efc5fbc7e7843539320857962940fddc00ad24873d829e7654",
      "size": 14708
    },
    .........
  ]
}

adrianreber commented 1 year ago

@adrianreber, have you considered packing the checkpoint data in an OCI artifact? I was experimenting it like this (taken with containerd + criu):

Thanks for your interest in this topic. It is not really a question of how to do it. We have a working solution and there are many ideas on how to do it. In the end I am open to anything. I would just like to have a standard.

We originally looked at the containerd format, but that uses binary protobuf blobs which is a dependency we want to avoid. JSON would work, but not protobuf. The overhead to decode protobuf seems to complicate the format unnecessarily.

For a standard we would prefer JSON.

We have working container migration in combination with Kubernetes, but currently we use our CRI-O only format.

We would just like to have it standardized for better interoperability between runtimes and especially engines. We already migrated containers from Podman to CRI-O and I am pretty positive that it should be doable between many container engines, but a standard would be nice.

opencontainers / image-spec

Proposal: checkpoint image defintion #962

Background

Proposal