Add Sandbox API support into Kata Containers

studychao commented 1 year ago

This issue is supposed to be the umbrella issue for Sandbox API support in Kata. This will track all the todos and discussion in this issue.

Background

containerd community started designing Sandbox API since 2020 ( related issue : https://github.com/containerd/containerd/issues/4131 ). Now experimental Sandbox API implementation is shipped with containerd 1.7. And it's estimated that the stable version of Sandbox API will be shipped in containerd 2.0.

Sandbox API basically changes how we create sandboxs (i.e. VM), how we create containers, how we calculate VM resource and so many different aspects of Kata Container. It's so evolutionary that we have to act fast to take deep look into it and also conclude what Kata community needs to help building this API .

What is Sandbox?

[ from containerd ] The sandbox concept has the following properties in relation to containers it hosts:

Sandbox acts as a parent entity for containers, e.g. it starts first and ends last (this typically useful in (micro)VM environments, where VMs are required to be started before any other entities). Sandbox acquires resources needed for running child containers (for instance Kubernetes creates "pause" containers to acquire IP and network namespace for child containers).

What is Sandbox API?

Sandbox API make the sandbox first class citizen, and all the sandbox management logic will not be mixed in the Task API. For example, before Sandbox API, when we process the Task API request CreateContainer, we may also mix VM creation logic in that request.

Advantages it brings:

No more intertwined logic with containers and sandbox (VM) management.
No more redundant pause container in Kata Containers when using Sandbox API.
Quicker startup time and less memory consumption with simplied architecture.

Proposal

Plan 1 : runtime-rs + non built-in VMM ( e.g., Cloud Hypervisor, QEMU, Firecracker)

Stage 1

runtime-rs+non-built-in+stage 1

Stage 2

runtime-rs+non-built-in

What's new?

Introduce the Sandboxer (temporary name) process to manage all containers on the node in a 1:N model, replacing the shim process that was present for each POD in 1:1 model. So there is only 1 process for each POD.
By using the plugin service architecure in runtime-rs, we could introduce Sandbox Service in sandboxer process to handle sandbox related management, Task Service in sandboxer process to handle container related management.

Why 2 stages?

Since the containers inside Guest 's IO operations need shim sending rpc call to agent to get the IO stream and shim will then send that back to containerd. So we need shim for IO operations and that's why stage 1 exists.
But #6714 could solve this problem by bypassing shim to do IO operations. After #6714 is solved and closed, we could switch to a cleaner architecture in stage 2 with a 1:N model . Pros:
1. The sandbox control path and container control path are separated and clearly defined.
2. Lower memory consumption and quicker startup time since there is no more shim process and no more pause containers.

Cons:

The stability of the sandboxer process will affect all the Kata containers in the node. It must be really reliable.

Plan 2 : runtime-rs + built-in VMM Dragonball

What's new?

In the runtime-rs + Dragonball architecture, we have already embedded the VMM, and there is only one process for Kata. So we don't need to use component like Sandboxer to decrease the process number of process. We directly implement task service and sandbox service in kata shim.

Pros:

We no longer need to rely on a single sandboxer process on the node side to control the lifecycle of Kata containers to decrease the process number. The 1:N model of sandboxer process design would significantly increase the complexity of operations and maintenance, and if the sandboxer encounters problems, it may affect the status of the entire node's containers. At the same time, the sandboxer process requires the introduction of restart recovery and hot upgrade mechanisms, which pose risks when used in production environments.
The sandbox control path and container control path are separated and clearly defined.
No more pause container.

We already have a really simple POC code that proves the feasibility of this plan, please check here : https://github.com/wllenyj/kata-containers/commit/f5b62a2d7c728d1b260afb10d9df144640d27a01

Actions

[ ] Send out the RFC Pull Request for the plan 2 runtime-rs + built-in VMM Dragonball

fidencio commented 2 months ago

/cc @littlejawa @gkurz

One thing that comes to my mind here is how this will affect CRI-O support, considering that the sandbox API is a "containerd-only" feature.

lifupan commented 2 months ago

Hi @fidencio

I think we'd better to keep kata support both of the sandbox api and non sandbox api, thus crio and containerd/(older containerd without sandbox api support ) are all supported.

fidencio commented 2 months ago

I think we'd better to keep kata support both of the sandbox api and non sandbox api, thus crio and containerd/(older containerd without sandbox api support ) are all supported.

This makes a lot of sense to me :-)

littlejawa commented 2 months ago

I will have to dig deeper, but as I understand it, the "Sandbox API" is an evolution of the ShimV2 interface, which crio already uses. At a high level view, it looks like it could be supported on the crio side with changes to API calls, without big changes to the structure of crio (as we already have a dedicated shimV2 code path). But I definitely need more time to look into it, so supporting both API at least for some time is welcome. I'll keep an eye on all that. Thanks for pinging me here @fidencio :)

amshinde commented 3 weeks ago

@studychao @lifupan Just taking a look at this. While the Sandbox API does provide improvements in terms of getting rid of the pause process and having a separate control path for the sanbox and containers, I am bit apprehensive about the Stage2 plan that you proposed for standalone hypervisors. Having a single sandboxer process means a single point of failure for all the pods on that node. Plus any signals sent to the shim today are for that pod itself, allowing some control which will be lost with the sandboxer process. This would not be an issue with dragonball.

Perhaps we should reconsider this and have a discussion on this in the Kata AC if not already done. We could consider just having phase1 implemented for standalone hypervisors.

lifupan commented 3 weeks ago

Hi @amshinde

Yes, you are right. Currently, the focus is on the first phase. There is no specific plan for the second phase. Even if the second phase is to be done, it will be discussed with everyone at the AC meeting in advance.

amshinde commented 3 weeks ago

@lifupan Thanks for confirming.

kata-containers / kata-containers