adrianreber commented 3 years ago

Enhancement Description

One-line enhancement description (can be used as a release note): Forensic Container Checkpointing
Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2008-forensic-container-checkpointing
Discussion Link:
- SIG Node weekly meeting: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit
- SIG Node planning doc for v1.23: https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit
Primary contact (assignee): @adrianreber
Responsible SIGs: Sig Node
Enhancement target (which target equals to which milestone): No target so far
- Alpha release target (x.y): 1.25
- Beta release target (x.y): 1.30
- Stable release target (x.y): 1.32
Documentation
- https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
- https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
- https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/
- [x] Alpha (1.25)
- [x] KEP (k/enhancements) update PR(s):
- [x] Code (k/k) update PR(s):
  - [x] https://github.com/kubernetes/kubernetes/pull/104907
  - [x] https://github.com/kubernetes/kubernetes/pull/115155
- [x] Docs (k/website) update(s):
- [ ] Beta (1.30)
- [x] KEP (k/enhancements) update PR(s):
  - [x] https://github.com/kubernetes/enhancements/pull/4288
- [x] Code (k/k) update PR(s):
  - [x] https://github.com/kubernetes/kubernetes/pull/123215
- [ ] Docs (k/website) update(s):
- [ ] https://github.com/kubernetes/enhancements/pull/4305
- [ ] https://github.com/kubernetes/kubernetes/pull/120898
- Abandoned PR: https://github.com/kubernetes/kubernetes/pull/115888

Jiaxuan-C commented 1 year ago

Thank you very much for your solution.

It is not really a solution. I should fix CRI-O to correctly work with checkpoint and restore on cgroup v2 systems.

Looking forward to your fix

I also encountered the same issue when I was modifying the kubelet and containerd code to customize 'checkpoint/restore in k8s'. Is it possible that the issue is with CRIU instead of the container runtime ？(CRI-O or containerd) (I'm not a professional. just a guess ) Anyway, thank you very much

gvhfx commented 9 months ago

@adrianreber Hello, I have carefully read your blog and attempted to perform container checkpoint/restore based on your steps. However, I encountered the following issue: when I tried to recover the Pod, it remained in the CreateContainerError state and the kubelet displayed an error: "no command specified." I have browsed through related issues and tried some solutions, but none of them have been successful. Could you please advise on how to resolve this? I sincerely appreciate your assistance!

adrianreber commented 9 months ago

@gvhfx are you using cgroupv2?

gvhfx commented 9 months ago

@adrianreber I have checked the version of cgroup and it is using v1. Additionally, my environment is CentOS7, and the previous kernel version was 3.10 (which does not support cgroupv2). I initially thought it might be due to the kernel, so I upgraded it to 5.4. However, I am still encountering the same error.

adrianreber commented 9 months ago

@gvhfx CentOS 7 is really old. Can you try something newer?

Tobeabellwether commented 9 months ago

Hi @adrianreber, over the past few months I've been trying to do some higher-level migrations using the checkpointing technique you developed:

I first tried migrating the Pods, which as you mentioned in your presentation, is basically checkpointing all the containers and then matching some metadata.

Then I tried to migrate the Pod on the Replicaset. I first deleted the Replicaset but kept the Pods, and then migrated the target Pod. After that, I recreated the Replicaset with the same configuration. The label of the migrated Pod should match the Selector of the Replicaset. so that the recreated Replicaset will capture it.

I've also done the migration on a Deployment and a StatefulSet using the same logic, and so far everything is working fine when just testing with your counter example.

However, I have no experience as a Kubernetes developer, so I just did this through python scripts. Therefore, I would like to ask if my approach really makes sense. If so, how difficult it is to implement these in Kubernetes?

adrianreber commented 9 months ago

However, I have no experience as a Kubernetes developer, so I just did this through python scripts. Therefore, I would like to ask if my approach really makes sense. If so, how difficult it is to implement these in Kubernetes?

Sounds great. I do not think it would be too difficult. I also had an implementation of pod checkpoint/restore three years ago. I did the pod checkpoint creation in CRI-O, not sure how you have done it.

Tobeabellwether commented 9 months ago

However, I have no experience as a Kubernetes developer, so I just did this through python scripts. Therefore, I would like to ask if my approach really makes sense. If so, how difficult it is to implement these in Kubernetes?

Sounds great. I do not think it would be too difficult. I also had an implementation of pod checkpoint/restore three years ago. I did the pod checkpoint creation in CRI-O, not sure how you have done it.

I didn't do low-level stuff, I simply used your forensic checkpointing multiple times on all containers of a pod, which only supports CRI-O for now right? and then updated the container and node part in the Pod configuration, leaving the rest of the configuration basically unchanged

adrianreber commented 9 months ago

Pull request to automatically delete older checkpoint archives: https://github.com/kubernetes/kubernetes/pull/115888

adrianreber commented 9 months ago

First attempt to provide checkpoint via kubectl: https://github.com/kubernetes/kubernetes/pull/120898

Tobeabellwether commented 8 months ago

Hi @adrianreber, I created a simple microservice pod and tried to migrate it. I found that when I just started it and used a counter-like function, I was able to checkpoint it, but when I used it to connect to the message broker and send the message, checkpointing it will raise the following error, is there any way to solve it?

checkpointed: checkpointing of default/order-service-7c69b4d88b-n56xq/order-service failed (rpc error: code = Unknown desc = failed to checkpoint container b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af:

running "/usr/local/bin/runc" ["checkpoint" "--image-path" "/var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata/checkpoint" "--work-path" "/var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata" "--leave-running" "b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af"]

failed: /usr/local/bin/runc --root /run/runc --systemd-cgroup checkpoint --image-path /var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata --leave-running b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af

failed: time="2023-10-24T16:40:11Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: /var/lib/containers/storage/overlay-containers/b0ae35f30db3124341035041d02ce85bd83be2b38180081ab919a9f89e16c3af/userdata/dump.log"

adrianreber commented 8 months ago

@Tobeabellwether Please open a bug at CRI-O with the dump.log attached.

Tobeabellwether commented 8 months ago

@Tobeabellwether Please open a bug at CRI-O with the dump.log attached.

@adrianreber Thanks for the tip, I checked that dump.log file and found:

(00.134562) Error (criu/sk-inet.c:191): inet: Connected TCP socket, consider using --tcp-established option. (00.134634) ---------------------------------------- (00.134654) Error (criu/cr-dump.c:1669): Dump files (pid: 1533879) failed with -1

So, I tried to forcefully interrupt the TCP connection between the pod and the message broker, and the checkpoint was successfully created. Should this still be considered a bug?

adrianreber commented 8 months ago

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future.

You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Tobeabellwether commented 8 months ago

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future.

You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Hi @adrianreber again, when I try to restore the checkpoint of the pod with TCP connection on a new pod, I encounter a new problem:

So I try to check the log file under /var/run/containers/storage/overlay-containers/order-service/userdata/restore.log but I only found those folders, no one with the name of the container to restore:

and the restoration for pods without TCP connections works fine.

adrianreber commented 8 months ago

@Tobeabellwether You can redirect the CRIU log file to another file using the CRIU configuration file: log-file /tmp/restore.log and have a look at that file.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

adrianreber commented 5 months ago

/remove-lifecycle stale

mrunalp commented 5 months ago

/stage beta /milestone v1.30 /label lead-opted-in

sreeram-venkitesh commented 5 months ago

Hello @adrianreber 👋, v1.30 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 9th February 2024.

This enhancement is targeting for stage beta for v1.30 (correct me, if otherwise)

Here's where this enhancement currently stands:

[x] KEP readme using the latest template has been merged into the k/enhancements repo.
[x] KEP status is marked as implementable for latest-milestone: 1.30. KEPs targeting stable will need to be marked as implemented after code PRs are merged and the feature gates are removed.
[x] KEP readme has up-to-date graduation criteria
[x] KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here).

Everything is done in https://github.com/kubernetes/enhancements/pull/4288 and https://github.com/kubernetes/enhancements/pull/4305. Please make sure that these PRs are merged before the enhancements freeze.

The status of this enhancement is marked as At risk for enhancement freeze currently. Please make sure your PRs are merged in time.

sreeram-venkitesh commented 4 months ago

Hello 👋, v.130 Enhancements team here.

Unfortunately, this enhancement did not meet requirements for enhancements freeze.

4288 is merged, but #4305 is still open. Please file an exception request to get this PR merged.

If you still wish to progress this enhancement in v1.30, please file an exception request. Thanks!

salehsedghpour commented 4 months ago

/milestone clear

HFourier commented 3 months ago

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future. You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Hi @adrianreber again, when I try to restore the checkpoint of the pod with TCP connection on a new pod, I encounter a new problem:

So I try to check the log file under /var/run/containers/storage/overlay-containers/order-service/userdata/restore.log but I only found those folders, no one with the name of the container to restore:

and the restoration for pods without TCP connections works fine.

I have the same problem, have you solved it?

Tobeabellwether commented 3 months ago

Ah, good to know. No, this is not a real bug. We probably at some point need the ability to pass down different parameters from Kubernetes to CRIU. But this is something for the far future. You can also control CRIU options with a CRIU configuration file. Handling TCP connections that are established could be configured there.

Hi @adrianreber again, when I try to restore the checkpoint of the pod with TCP connection on a new pod, I encounter a new problem:

So I try to check the log file under /var/run/containers/storage/overlay-containers/order-service/userdata/restore.log but I only found those folders, no one with the name of the container to restore:

and the restoration for pods without TCP connections works fine.

I have the same problem, have you solved it?

I've read CRIU's doc, I think checkpointing Container with data in and out is in general tricky and not safe, so I just always close its connection before I checkpointing.

sreeram-venkitesh commented 1 month ago

Hi @adrianreber 👋, 1.31 Enhancements Lead here.

If you wish to progress this enhancement in v1.31, please have the SIG lead opt-in your enhancement by adding the lead-opted-in label and set the milestone to v1.31 before the Production Readiness Review Freeze.

/remove-label lead-opted-in

kubernetes / enhancements

Forensic Container Checkpointing #2008

Enhancement Description

4288 is merged, but #4305 is still open. Please file an exception request to get this PR merged.