kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.09k stars 1.01k forks source link

Add Sandbox API support into Kata Containers #7043

Open studychao opened 1 year ago

studychao commented 1 year ago

This issue is supposed to be the umbrella issue for Sandbox API support in Kata. This will track all the todos and discussion in this issue.

Background

containerd community started designing Sandbox API since 2020 ( related issue : https://github.com/containerd/containerd/issues/4131 ). Now experimental Sandbox API implementation is shipped with containerd 1.7. And it's estimated that the stable version of Sandbox API will be shipped in containerd 2.0.

Sandbox API basically changes how we create sandboxs (i.e. VM), how we create containers, how we calculate VM resource and so many different aspects of Kata Container. It's so evolutionary that we have to act fast to take deep look into it and also conclude what Kata community needs to help building this API .

What is Sandbox?

[ from containerd ] The sandbox concept has the following properties in relation to containers it hosts:

Sandbox acts as a parent entity for containers, e.g. it starts first and ends last (this typically useful in (micro)VM environments, where VMs are required to be started before any other entities). Sandbox acquires resources needed for running child containers (for instance Kubernetes creates "pause" containers to acquire IP and network namespace for child containers).

What is Sandbox API?

Sandbox API make the sandbox first class citizen, and all the sandbox management logic will not be mixed in the Task API. For example, before Sandbox API, when we process the Task API request CreateContainer, we may also mix VM creation logic in that request.

Advantages it brings:

Proposal

Plan 1 : runtime-rs + non built-in VMM ( e.g., Cloud Hypervisor, QEMU, Firecracker)

Stage 1

runtime-rs+non-built-in+stage 1

Stage 2

runtime-rs+non-built-in

What's new?

Why 2 stages?

Cons:

  1. The stability of the sandboxer process will affect all the Kata containers in the node. It must be really reliable.

Plan 2 : runtime-rs + built-in VMM Dragonball

image

What's new?

Pros:

  1. We no longer need to rely on a single sandboxer process on the node side to control the lifecycle of Kata containers to decrease the process number. The 1:N model of sandboxer process design would significantly increase the complexity of operations and maintenance, and if the sandboxer encounters problems, it may affect the status of the entire node's containers. At the same time, the sandboxer process requires the introduction of restart recovery and hot upgrade mechanisms, which pose risks when used in production environments.
  2. The sandbox control path and container control path are separated and clearly defined.
  3. No more pause container.

We already have a really simple POC code that proves the feasibility of this plan, please check here : https://github.com/wllenyj/kata-containers/commit/f5b62a2d7c728d1b260afb10d9df144640d27a01

Actions

fidencio commented 2 months ago

/cc @littlejawa @gkurz

One thing that comes to my mind here is how this will affect CRI-O support, considering that the sandbox API is a "containerd-only" feature.

lifupan commented 2 months ago

Hi @fidencio

I think we'd better to keep kata support both of the sandbox api and non sandbox api, thus crio and containerd/(older containerd without sandbox api support ) are all supported.

fidencio commented 2 months ago

I think we'd better to keep kata support both of the sandbox api and non sandbox api, thus crio and containerd/(older containerd without sandbox api support ) are all supported.

This makes a lot of sense to me :-)

littlejawa commented 2 months ago

I will have to dig deeper, but as I understand it, the "Sandbox API" is an evolution of the ShimV2 interface, which crio already uses. At a high level view, it looks like it could be supported on the crio side with changes to API calls, without big changes to the structure of crio (as we already have a dedicated shimV2 code path). But I definitely need more time to look into it, so supporting both API at least for some time is welcome. I'll keep an eye on all that. Thanks for pinging me here @fidencio :)

amshinde commented 3 weeks ago

@studychao @lifupan Just taking a look at this. While the Sandbox API does provide improvements in terms of getting rid of the pause process and having a separate control path for the sanbox and containers, I am bit apprehensive about the Stage2 plan that you proposed for standalone hypervisors. Having a single sandboxer process means a single point of failure for all the pods on that node. Plus any signals sent to the shim today are for that pod itself, allowing some control which will be lost with the sandboxer process. This would not be an issue with dragonball.

Perhaps we should reconsider this and have a discussion on this in the Kata AC if not already done. We could consider just having phase1 implemented for standalone hypervisors.

lifupan commented 3 weeks ago

Hi @amshinde

Yes, you are right. Currently, the focus is on the first phase. There is no specific plan for the second phase. Even if the second phase is to be done, it will be discussed with everyone at the AC meeting in advance.

amshinde commented 3 weeks ago

@lifupan Thanks for confirming.