kata-containers / tests

Kata Containers tests, CI, and metrics
https://katacontainers.io/
Apache License 2.0
140 stars 196 forks source link

CCv0: Discussion options for removing the CC branch and running confidential-containers out of kata-containers `main` #4441

Open stevenhorsman opened 2 years ago

stevenhorsman commented 2 years ago

For the confidential containers project our end goal was always to merge code into the main branches of kata-containers and tests, but whilst we were doing PoCs in CCv0 we decided the simplest approach was to work on a separate branch. As we entered the 'CCv1' roadmap we thought that that fact we were using a containerd fork blocked this, but as the burden of managing two development branches is continuing we want to give it some attention and work out what the options are and what is viable.

Here is the updated plan from my perspective, but it might well be wrong, or incomplete, so other input is welcome:

Our favourite suggestion so far is to have all the content for 'normal' kata and 'CC' kata coming from the same main branch, but provide a build config to create different binaries for the different versions.

Description

wainersm commented 2 years ago

Hi @stevenhorsman , just brainstorming here...maybe we don't need to duplicate the CI jobs. Keep the current Kata jobs building and testing the "normal" Kata Containers; then add just one new job for CC specific tests.

Suppose that the new job configuration is called CC_CRI_CONTAINERD then basically we need this:

diff --git a/.ci/ci_job_flags.sh b/.ci/ci_job_flags.sh
index 0eb255dc..eb8202b8 100755
--- a/.ci/ci_job_flags.sh
+++ b/.ci/ci_job_flags.sh
@@ -93,13 +93,21 @@ case "${CI_JOB}" in
        export KATA_HYPERVISOR="qemu"
        export KUBERNETES="no"
        ;;
-"CRI_CONTAINERD"|"CRI_CONTAINERD_K8S")
+"CRI_CONTAINERD"|"CRI_CONTAINERD_K8S"|"CC_CRI_CONTAINERD")
        # This job only tests containerd + k8s
        init_ci_flags
        export CRI_CONTAINERD="yes"
        export CRI_RUNTIME="containerd"
        export KATA_HYPERVISOR="qemu"
-       [ "${CI_JOB}" == "CRI_CONTAINERD_K8S" ] && export KUBERNETES="yes"
+       case "${CI_JOB}" in
+               "CRI_CONTAINERD_K8S")
+                       export KUBERNETES="yes"
+                       ;;
+               "CC_CRI_CONTAINERD")
+                       # Export any CC specific environment variables
+                       export CC="yes"
+                       ;;
+       esac
        ;;
 "CRI_CONTAINERD_K8S_INITRD")
        # This job tests initrd image + containerd + k8s
diff --git a/.ci/run.sh b/.ci/run.sh
index 8d27f4c8..4580e297 100755
--- a/.ci/run.sh
+++ b/.ci/run.sh
@@ -55,6 +55,9 @@ case "${CI_JOB}" in
                echo "INFO: Running ksm test"
                sudo -E PATH="$PATH" CRI_RUNTIME="containerd" bash -c "make ksm"
                ;;
+       "CC_CRI_CONTAINERD")
+               echo "INFO: Running Confidential Container tests"
+               sudo -E PATH="$PATH" CRI_RUNTIME="containerd" bash -c "make cc-tests"
        "CRI_CONTAINERD_K8S_COMPLETE")
                echo "INFO: Running e2e kubernetes tests"
                sudo -E PATH="$PATH" CRI_RUNTIME="containerd" bash -c "make kubernetes-e2e"

The makefile target cc-tests (just an example, ok?) will call the CC specific integration tests.

stevenhorsman commented 2 years ago

Hi @stevenhorsman , just brainstorming here...maybe we don't need to duplicate the CI jobs. Keep the current Kata jobs building and testing the "normal" Kata Containers; then add just one new job for CC specific tests.

Sorry, I'm probably missing something obvious, but how do we test all ~12 os/cri/hypervisor combinations that Kata supports in this way if we only add 1 new job config?

wainersm commented 2 years ago

Hi @stevenhorsman , just brainstorming here...maybe we don't need to duplicate the CI jobs. Keep the current Kata jobs building and testing the "normal" Kata Containers; then add just one new job for CC specific tests.

Sorry, I'm probably missing something obvious, but how do we test all ~12 os/cri/hypervisor combinations that Kata supports in this way if we only add 1 new job config?

It was just an example. The way that Kata CI is designed, one need to created entries on .ci/ci_job_flags.sh corresponding to the combinations of CRI/Hypervisors to be tested. The .ci/run.sh script contain the instructions to run the suite of tests for each combination. And finally we mix that with different Jenkins jobs, resulting on the various combinations of os/cri/hypervisor we want to test.

stevenhorsman commented 2 years ago

Update: on the Architecture Committee meeting today (15th Feb) we brought up the issue of getting confidential containers code merged into main. My take on it was that everyone was generally on board and we didn't get push-back on the idea of merging it in general (we pitched having a CC build config and using that to drive the differences in the build e.g. whether the attestation-agent gets included in the rootfs). The main question is still about the containerd piece. Our best idea (from Fidencio and Samuel) so far is to re-name the confidential-containers fork of containerd to have a different module name e.g. containerd-cc or something, so then we could pull in both copies of the code into the vendor directory. We could then create a containerd wrapper layer than switches between the main implementation and cc implementation based on the build parameter (similar to an #ifdef). My golang is very basic, but will a quick google, it looks like it might be possible with go build tags. I think the next step is to try a PoC and see how it works.

stevenhorsman commented 2 years ago

Wainer has also suggested that go plugins might be able to help as an alternative to build tags

stevenhorsman commented 2 years ago

I think the single release option is no longer being considered, so I've removed it from the description for clarity, but I'll paste it here for reference:

Single release

Description

Pros:

Cons:

Common issues:

Other

Questions

jodh-intel commented 2 years ago

Thanks for raising this @stevenhorsman.

A few thoughts and questions. Note: these are not directed at you, but at the entire community...

Assumptions

ifdef build solutions

If any sort of #ifdef plan is required, we're going to have to be extremely careful to avoid packaging mistakes and to ensure we can discriminate between the two versions. As a mimimum we will need both containerd-shim-kata-v2 --version and the announce function to log details of how that version of the runtime has been built.

Containerd fork

It would be useful if we could summarise:

Avoid confusion

How is all this going to be documented?

It's seems odd to document CC functionality in Kata repos, but if that's where the code lives, we will have to atleast create a set of links back to the actual docs in an appropriate CC repo.

Unless handled carefully, it's going to be rather unclear to users and contributors why there are two separate GitHub organisations imho.

Merge periodicity

Is this a "one off" merge or is there a possibility of further CC code syncs into Kata?

If there is a possibility of this happening (and I can imagine there is), we need to consider some sort of merge window rules now. For example, we don't want to land a lot of CC changes in Kata close to the date of new Kata releases (even if the CC code isn't being actively used by Kata).

Maybe something as simple as this would suffice?:

We could also consider something like this:

/cc @kata-containers/architecture-committee for further input.

stevenhorsman commented 2 years ago

Thanks so much for the feedback @jodh-intel - I really appreciate the feedback and this is something that we all have to do together and be happy with if we are going to be successful.

In the Kata CC meeting, I took an action to break this issue down into separate issues for different components, but I think it is worth having the discussion here first as it might get some of the higher-level topics dealt with first before we worry about the specific code merges. I definitely don't have all the answers and as you've said it's a discussion for the whole community, but I'll give my initial reaction to some of the questions raised.

Assumptions

  • Minimal impact on "vanilla Kata":

    • User experience: Kata will continue to work as it does today with no additional effort required on the behalf of users/admins.
    • Performance/metrics: The overall system performance using the default Kata configuration will not be negatively impacted by these changes.

I agree. I think we'd prefer to minimise the '#ifdef'-ing of code (apart from in the containerd section which we can't currently avoid). This would mean that we'd just have a single version of the agent which would pull into the extra crates required for image offload. I hope that the footprint of these won't be too large, but it's something we'll want to work with some performance specialists to check before we merged anything.

ifdef build solutions

If any sort of #ifdef plan is required, we're going to have to be extremely careful to avoid packaging mistakes and to ensure we can discriminate between the two versions. As a mimimum we will need both containerd-shim-kata-v2 --version and the announce function to log details of how that version of the runtime has been built.

Yep - if we can restrict the go build tag/#ifdef type behaviour to the containerd-shim-kata-v2 and separate all the CC specific code into separate file(s), then one of the things we'll have to put on the list is to have some type of cc discriminator in the version.

Containerd fork

I'm not an expert on all the details, so @wllenyj will be able to give more details, but here is my basic understanding:

It would be useful if we could summarise:

  • Why a containerd fork is needed.

I think the simple answer here is that we need to modify the containerd code to enable us to pass the pull_image request through the shim and to the agent rather than being run on the host, in order to enable image offload.

  • How the fork differs to upstream.

I think the approach that the Ali team have taken is to create a new image service and pull image command in the shim and implemented a CC CRI service that handles the routing of the commands. I'm sure this is probably not completely accurate. For reference the full changes can be seen at: https://github.com/confidential-containers/containerd/compare/main...confidential-containers:CC-main

  • How we plan to maintain that fork (wrt security fixes, etc).

My current thinking is that we keep the CC fork of containerd tightly aligned with the version that 'vanilla Kata' is using, so when Kata updates we will rebase the CC fork on that version. That minimises the code differences in Kata and tries to ensure we don't get left too far behind.

  • How long we are comfortable maintaining such a fork (it's potentially a huge cost).
  • What the stumbling blocks and/or timeline for getting the fork changes merged upstream.

That's the $64,000 question. We initially hoped that containerd would have accepted our changes, or at least ideas by now and avoid this problem, but from the discussions that Samuel et al. have had with Derek and co. it seems like they have a grander to re-architect containerd to provide the ability to offload most of the containerd services when it starts. This is way beyond the scope of what we need, but means that they are not going to merge our fork and their new architecture won't be ready in the short term (my impression is we are looking at closer to 12 months than 1, or 2, but I'm definitely not 'in the know').

That's how long we are might have to maintain the fork, which we are pretty much committed to for CC, as we don't have a solution without it, but it's up to the wider Kata community to decide if they are comfortable with this.

Avoid confusion

How is all this going to be documented?

It's seems odd to document CC functionality in Kata repos, but if that's where the code lives, we will have to atleast create a set of links back to the actual docs in an appropriate CC repo.

Unless handled carefully, it's going to be rather unclear to users and contributors why there are two separate GitHub organisations imho.

At the moment, some of the limited doc we have is in the kata CCv0 fork, but as we start to pull in more repos from the confidential-containers org it might makes sense for it to migrate to that org where Kata is part of the wider 'CC' solution and then just sign post to it from the Kata repositories

Merge periodicity

Is this a "one off" merge or is there a possibility of further CC code syncs into Kata?

This depends on the Kata Architecture committee and wider community, but the way I would like to see it is that we'd have a few smaller scoped PRs to get the existing CCv0 code into main and once that is done we'd delete and 'CC' branches and work directly against main from then on integrating any 'CC' code directly rather than going via a branch. I think the aim of confidential-containers is to pull in the upstream versions of all the components we used rather than having our own forks/branches. Obviously we've come unstuck with that with containerd at the moment, but it's the only viable way to proceed long term as far as I can see without ending up with a lot of duplicated effort.

As I said - this is my own thoughts and probably not representative of the CC community and people who are much smarter and more experienced in this than me. Maybe @sameo @fitzthum @magowan, or @ariel-adam can elaborate further and correct me?

jodh-intel commented 2 years ago

Thanks @stevenhorsman. A few more questions:

containerd fork

fwics the ideal would be for upstream containerd to allow us to configure it to:

Although these features wouldn't generally be useful to most, they don't look that contentious. Is the problem then that containerd don't want us to spend time on a containerd feature that would (have to) be dropped when their new architecture lands? Is there a middle ground I wonder where we could create a new containerd plugin for the features we'd like? It looks like containerd already supports these but maybe the scope of that interface might need to be expanded to support the features we need?

Short-term alternatives?

Random thought: containerd is but one container manager. If we are going to have to wait for the upstream grand redesign work to be finished, to avoid having to maintain a fork of containerd ourselves (and all the costs and complications that brings with it), could we use an alternative such as podman or even Docker (once we have a viable shimv2 "wrapper") as a stop-gap solution?

Or are there other specific containerd features we need for Kata+CC?

/cc @c3d, @dgibson.

fidencio commented 2 years ago

@stevenhorsman, I'd explicitly add the need of having a kata-deploy image being shipped for the CC specific effort.

c3d commented 2 years ago

@jodh-intel For them, there is also the problem of having to maintain a specific feature they can't really test, and that also conflicts with their changes in progress.

jodh-intel commented 2 years ago

+1 to @fidencio's idea of having separate kata-deploy images. For maximum safety, we'd add a few checks somewhere to assert (positively and negatively) what we expect (and don't expect) to find in the Kata and CC output of kata-deploy.

If we end up having to use the "ifdef CC" build approach, we'll also need a similar set of checks for the CI (ensure-kata.sh and ensure-cc.sh type tests).

Metrics

As discussed in the AC meeting earlier today, we can "suck it and see" by raising a PR and just reviewing what the metrics CI thinks of it. If we have a single image that includes the CC elements, we may need to adjust the thresholds but if we are talking about separate images, we might need a separate set of CC metrics thresholds.

/cc @GabyCT who might have thoughts on this.

GabyCT commented 2 years ago

@jodh-intel currently for CC there is not metrics CI, in case that you are testing with CC elements you will need a new baremetal to test and enable that CI as probably the thresholds will be different

stevenhorsman commented 2 years ago

Whilst the conversation is on-going I've started the process of breaking down this into separate issues for different areas that we might need to separate and merge. This a still WIP and likely to change before we can run with them:

wainersm commented 2 years ago

Hi @stevenhorsman,

Whilst the conversation is on-going I've started the process of breaking down this into separate issues for different areas that we might need to separate and merge. This a still WIP and likely to change before we can run with them:

* [ci: Add confidential-containers awareness to the build/test jobs #4626](https://github.com/kata-containers/tests/issues/4626)

* [ci: Add support for installing the confidential-containers fork of containerd for integration testing #4627](https://github.com/kata-containers/tests/issues/4627)

* [agent: Merge confidential containers agent/guest changes into main branch kata-containers#3971](https://github.com/kata-containers/kata-containers/issues/3971)

Am I wrong assuming that https://github.com/kata-containers/kata-containers/issues/3971 can be the first task we should focus on?

IIUC merging the CC agent/guest into main doesn't break the non-CC use case of Kata, although CC specific features (e.g. pull image inside the guest) won't be working until the runtime and containerd are changed.

The current chain of dependencies seems to be:

https://github.com/confidential-containers/image-rs/pull/5 -> (release image-rs x.y so we stop pulling from main?) -> https://github.com/kata-containers/kata-containers/issues/3970 -> https://github.com/kata-containers/kata-containers/issues/3971 -> #4626 -> #4627 -> #4628

* [runtime: Add support for the confidential-containers changes #4628](https://github.com/kata-containers/tests/issues/4628) 
stevenhorsman commented 2 years ago

@wainersm - Firstly, thanks for taking a look at the issues and giving feedback - it's really appreciated. I also have a confession that once I read your flow I realised that I should have split out merging the runtime and integration tests into separate issues for better clarity, so have both https://github.com/kata-containers/kata-containers/issues/3996 and https://github.com/kata-containers/tests/issues/4628 now. Sorry if that makes you comments make less sense

Am I wrong assuming that kata-containers/kata-containers#3971 can be the first task we should focus on?

IIUC merging the CC agent/guest into main doesn't break the non-CC use case of Kata, although CC specific features (e.g. pull image inside the guest) won't be working until the runtime and containerd are changed.

The current chain of dependencies seems to be:

confidential-containers/image-rs#5 -> (release image-rs x.y so we stop pulling from main?) -> kata-containers/kata-containers#3970 -> kata-containers/kata-containers#3971 -> #4626 -> #4627 -> #4628

Sort of. I agree on the dependencies: confidential-containers/image-rs#5 -> (release image-rs x.y so we stop pulling from main?) -> kata-containers/kata-containers#3970 -> kata-containers/kata-containers#3971 -> kata-containers/kata-containers#3996 -> #4628 (also dependant on #4627)

However my thinking (which might be flawed) is that we can make progress on #4626, then #4627 before the agent and runtime changes are merged. If we did this then they would be testing regression scenarios from 'vanilla' Kata with the forked containerd installed, rather than new 'confidential-containers' integration tests, but I think there is still value in this. Also have #4626 might be useful for us if we want to try out only adding the attesation-agent to the guest image if we are creating the 'CC' build (which might be required depending on how much it bloats the image.)

I hope that splitting the flow into two slightly parallel streams helps speed up our progress, but there might well be something I've failed to take into account?

wainersm commented 2 years ago

@stevenhorsman I think your comments on https://github.com/kata-containers/tests/issues/4441#issuecomment-1082789573 made the plan very clear to me. IMO that's detailed enough to allow us kick off some changes on main. Thus, I assigned to myself the issues #4626 , #4627 , and #4628; but if you (or someone of your team) had the intention to work on those issues then I am glad to re-assign them. :)

wainersm commented 2 years ago

Linking this issue with kata-deploy: Provide a kata-deploy image to be used with the Confidential Containers operator

stevenhorsman commented 2 years ago

@stevenhorsman I think your comments on #4441 (comment) made the plan very clear to me. IMO that's detailed enough to allow us kick off some changes on main. Thus, I assigned to myself the issues #4626 , #4627 , and #4628; but if you (or someone of your team) had the intention to work on those issues then I am glad to re-assign them. :)

@wainersm - I was thinking of taking #4627 as it's mostly just a port of the code I wrote in CCv0, so would simplify the DCO sign-off. I think #4628 is obviously a lot larger and will require us to sync up depending on when it is unblocked, but I'm (or someone on my team) happy to be involved in that and #4626 if there is anything we can help with!

wainersm commented 2 years ago

@stevenhorsman I think your comments on #4441 (comment) made the plan very clear to me. IMO that's detailed enough to allow us kick off some changes on main. Thus, I assigned to myself the issues #4626 , #4627 , and #4628; but if you (or someone of your team) had the intention to work on those issues then I am glad to re-assign them. :)

@wainersm - I was thinking of taking #4627 as it's mostly just a port of the code I wrote in CCv0, so would simplify the DCO sign-off. I think #4628 is obviously a lot larger and will require us to sync up depending on when it is unblocked, but I'm (or someone on my team) happy to be involved in that and #4626 if there is anything we can help with!

I just reassigned #4627 to you as it makes complete sense. I will try to work on #4626 first. Let's see how it goes.

stevenhorsman commented 2 years ago

So just an update to the merge of CCv0 into main, based on a discussion we had at the 21st April CC meeting: Given what we heard from Derek MacGowan about the 1.7 release dates we agreed that rather than push ahead with trying to get support for our confidential-containers fork of containerd merged into main immediately, we would wait until the 1.7 beta was available at the end of next month. At this point we will try switching to that and dropping our fork (this might take a while as I believe there are some complications with rust ttrpc supporting streaming?). The timeframe for the 1.7/2.0 release in October-ish is a little longer than we'd like and we know at this point that we also want the Kata main branch to adopt the new version of containerd, but we felt like trying to get this dual vendor support merged would be very difficult with a new version on containerd that would supercede it coming up soon. Obviously once we learn more information about the release dates and plans and the new code this might change, but the community felt that investing much time in the dual-vendoring approach wasn't the best move currently. With respect to the guest/agent changes, everything we've heard makes us believe the new containerd code won't change anything beyond the shim, so we can proceed with trying to get agent changes merged, once we've removed skopeo and umoci.

wainersm commented 2 years ago

Thanks for the details update @stevenhorsman ! I wasn't able to join that meeting but I think the decision the community took was the best one.

stevenhorsman commented 1 year ago

So to rehydrate this discussion 7 months on with a few thing changing I think we have the following high-level items to resolve/split up and merge before we can completely be free of the CCv0 branch:

I'm sure there are some other items, so keep me honest and I'll try and update this list. As we get more uniform agreement we can covert them into issues.

wainersm commented 1 year ago

Hi @stevenhorsman , so minor changes that came to my mind while in the meeting today:

stevenhorsman commented 1 year ago

Hi @stevenhorsman , so minor changes that came to my mind while in the meeting today:

  • Operator e2e CI should be changed to use tests main branch -- one line change
  • The workflows that push cc-payload should be changed too -- hopefully a couple of one line changes

Thanks, I'll update the list above to try and reflect this

surajssd commented 1 year ago

Migrate over to the upstream containerd, when a (beta?) release is available and update the kata-runtime code to be compatible with the https://github.com/containerd/containerd/pull/7320. Then get this version of containerd adopted in main before we merge the image service related code.

@stevenhorsman is there an issue in the kata repo where the work to migrate from the forked containerd to upstream containerd is being tracked?

stevenhorsman commented 1 year ago

Migrate over to the upstream containerd, when a (beta?) release is available and update the kata-runtime code to be compatible with the containerd/containerd#7320. Then get this version of containerd adopted in main before we merge the image service related code.

@stevenhorsman is there an issue in the kata repo where the work to migrate from the forked containerd to upstream containerd is being tracked?

Hey Suraj, not that I'm aware of. I can start one tomorrow if that's helpful, or anyone else can for that matter and I'll link to it in the above list.

surajssd commented 1 year ago

@stevenhorsman https://github.com/kata-containers/kata-containers/issues/6067 created an issue.

I don't have a lot of context on what needs to be done there other than just switching from the forked version to the latest beta. But I have created the tracking issue. Also that issue needs to have the label area/confidential-containers.