Propose a Shim in Rust (and one more thing)

kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/

Apache License 2.0

5.53k stars 1.06k forks source link

Propose a Shim in Rust (and one more thing) #3052

Closed gnawux closed 2 years ago

gnawux commented 2 years ago

Hello folks,

I am back:) (lol I have never left actually). I filed this issue as we discussed in the AC meeting this week.

In short, what I am proposing:

The shim-v2 implemented in Rust; and
An optional built-in sandboxing feature in the shim.
We'd like to begin to submit PRs in the following weeks and complete this feature in 6 months (before the Berlin summit)
We’d like to consider this as a part of the features of version 3.0, and this issue is not suggesting any time schedule of dropping the go-shim, and I think that should be decided by the community later.

Now the detail:

Since we introduced the rust-agent in late 2019 and made it default in Kata 2.0, there are voices to re-write the shim in Rust from many contributors, e.g. #1071. In Ant Group (and our collaborators in Alibaba), we received the same requirement in order to save the host memory, especially for the high-density case.

In the past year, my teammate Fupan(@lifupan), the original author of shim-v2( kata-containers/runtime#572 ), and other contributors from our team and Alibaba Cloud team have eventually implemented a Rust version shim and made it pass the same test cases of the existing Go version.

In the meanwhile, as we have already had Rust-VMM, which has been proven works by Cloud-Hypervisor and Firecracker, and in our deployments, most of the Kata pods require only the basic sandboxing functionalities. It is reasonable to have a simple built-in sandboxing function in it and we did this eventually (most of the rust-vmm related works are contributed by the Alibaba Cloud team actually, who contributed to the rust-vmm and cloud-hypervisor as well). For complicated or sensitive cases, we may still use the Qemu or Cloud-Hypervisor like before.

The rust shim has been deployed in Alibaba and Ant Group's kubernetes clusters this year. As a result, we saved massive memory in the host and simplified the operating. And we believe it should come back to the upstream so that we can get further benefit from the community development and keep our stack up-to-date.

zanetworker commented 2 years ago

@gnawux It would be great if you can also describe your migration path/story from 2.0 to 3.0 given that you were also using 2.0 in production I assume? If there were any migration tools that helped, also how did this affect your SLOs (did those change?), finally, how does it further help/improve Observability/Debuggability.

fidencio commented 2 years ago

@gnawux, just in order to have things as clear as possible, I'd like to ask whether you could split this into 2 issues:

The shim-v2 implemented in Rust
An optional built-in sandboxing feature in the shim

The first one is, AFAIU, what you're proposing for Berlim's summit.

The "optional built-in sandboxing feature in the shim" is the "shim+VMM all-in-one" binary, right? So, that can be a discussion we could and should treat in a different issue, as that's also quite huge and equally important.

Would that be okay?

gnawux commented 2 years ago

@zanetworker will check with team and summary them.

gnawux commented 2 years ago

@fidencio This is the kick-off issue of the whole thing and we will raise detailed issues and PRs. And In my plan, we could have the both parts in Berlin if everything went correct. However, if there were anything went wrong, we might adjust the roadmap :)

fidencio commented 2 years ago

@fidencio This is the kick-off issue of the whole thing and we will raise detailed issues and PRs. And In my plan, we could have the both parts in Berlin if everything went correct. However, if there were anything went wrong, we might adjust the roadmap :)

Okay, but I really would like to ask to have the issues raised way before the PRs, so we have enough time to discuss the issues (on whatever will need to be discussed). I am assuming then that you'll use this issue as a tracker issue for the others.

c3d commented 2 years ago

@gnawux Here are a few ideas and comments

The description seems to indicate that the shim reimplementation is well advanced already, and that the remaining work is sandboxing. Is that correct?
Assuming this is correct, was there any external repository for the development of that shim code so far, or was this all developed internally? If it was developed internally, was it in its own repository, or are there repository dependencies to Kata?
When you talk about "passing the same test cases", do you mean you can run the existing Kata CI, or some other series of tests?
If there is a large body of code already, how do you suggest we should review it? It makes sense that the community would need to take time to review it. As @fidencio remarked during the meeting, all developers don't have equivalent familiarity with Rust and Go

Thanks for attempting such major work.

zmlcc commented 2 years ago

@c3d I am @gnawux's colleague working on rust shim. Hope I can explain more details for your comments

@gnawux Here are a few ideas and comments

The description seems to indicate that the shim reimplementation is well advanced already, and that the remaining work is sandboxing. Is that correct?

No. We have implemented rust shim with build-in sandboxing internally, which has been used in production envorinment for quite a long time. I think our internal codes could be considered mature, but some refactor is needed to support a pluggable architecture to support other vmms

Assuming this is correct, was there any external repository for the development of that shim code so far, or was this all developed internally? If it was developed internally, was it in its own repository, or are there repository dependencies to Kata?

So far all developed internally. We followed the logic of the existing Kata but totally rewirted in rust. The agent part inherits most of Kata Agent codes

When you talk about "passing the same test cases", do you mean you can run the existing Kata CI, or some other series of tests?

Mostly based on our internal CI and fully tested in production environment

If there is a large body of code already, how do you suggest we should review it? It makes sense that the community would need to take time to review it. As @fidencio remarked during the meeting, all developers don't have equivalent familiarity with Rust and Go

I think we should gradually release out the rust code piece by piece, with detailed documents and carful reviews. The open source process is also a good time for us to refactor and clean up. We do need and welcome help from the community

Thanks for attempting such major work.

gnawux commented 2 years ago

Thanks @zmlcc for the explanation.

@c3d I agree a patch containing massive codes is nightmare for reviewers and I can still recall that it took us over 3 months to landing kata-shim-v2 code last time. What I suggest are:

Smaller PRs are better ones. The team will split them into PRs.
Pass the go-shim CIs to help us confirm it has the right CRI and other Interface behaviors.

For language, I personally found that almost all the system devs are interesting in Rust 😸. In the initial Rust-shim developing days, d@lifupan @bergwolf and others are working on the Go code as well, and we commit to continue working on the Go-shim maintaining as long as they are still been supported by kata community.

ariel-adam commented 2 years ago

First, I agree with Fabiano that we need to separate the shim-v2 feature from the sandboxing feature and I'd recommend to start with the shim-v2 only feature. This could help Ali avoid any dependencies they have on the sandboxing feature parts when they start pushing the code to review.

Secondly, regarding the actual shim-v2 code. I believe that Samuel and and Peng-Tao proposed at the time creating a separate repository to push in the shim-v2 code and use it to review the changes which makes sense to me (as opposed to creating a new branch). I think that in practice it will be hard to break the shim-v2 code into smaller PRs that actually compile and run so we could consider the following approach:

Pushing the Ali code to a new repository - this code will probably need work to ensure it's generic enough and working with the kata upstream (for example no dependencies on the sandboxing pat)
Define a duration of time for people in the community to review and get engaged in this code (I'd say 3-4 month)
This will also give people in the community time to gradually learn rust and to feel comfortable with those changes
Once that is done pushing the changes in however making the shim selection configurable and the default remains the go shim. This should also be a gradual process to ensure we don't break anything
As part of this process adding additional jobs to the CI (with maintainers) focusing on the rust shim however the exiting CI jobs will continue running on the go shim ensuring nothing breaks
These part should be completed by Berlin
Laster on, working together to build a migration process from workloads using the go shim to the rust shim and gradually shifting the development on new features only to the rust shim (obviously many open questions such has handling CVEs)
At some point switching the default shim from go to rust while ensuring we support the go shim for at least XXX month
We could try the DPDK approach of 2 year LTS releases as part of this effort: https://doc.dpdk.org/guides/contributing/stable.html

My 2 cents.

zmlcc commented 2 years ago

@ariel-adam It would be very nice to create a new empty repository to hold rust code. Actually, we are trying to refactor our internal code into small pieces for ease review and clean. I think we can push these pieces to the new repository by a series of PRs in the following steps:

shim crate. It will only contain the codes to implement containerd runtime v2 api, but no functionalities about containers
hypervisor crate. It will include the codes to interaction with vms. At first version, only some essential capabilities are involved and may only qemu is supported.
device crate. It will describe the devices attached to hypervisors.
agent crate. May existing kata agent's codes would be reused with minor modifications
virtcontainers crate. This is the key crate to connect all other crates and implement the core logic. Only After this crate is released, the whole program could actually compile together and run. At first version, only basic functions are supported. At this stage the process mode is consistent with the go version: one shim process and one vm process for one pod.
"dragonball" crate. It will be a minimal build-in sandboxing implement that could work well with other parts. At this stage we can merge shim and vm into one process to support one pod.

The above code release process would take several months. I think the community will have enough time to familiarize and review. If all goes well, then we will have the base repository that implement a generic shim in rust and optional built-in sandboxing. On this basis advanced development can be carried out, such as CI, more hypervisors support and richer features as the existing go version, etc.

I agree the switching from go to rust shim would take years and need lots of work to make sure nothing breaks and the new shim performs better. We will have to maintain two versions together for quite a long time, but we can gradually shift our major focus on the rust shim when it becomes mature

By the way, If we decide to switch to from go to rust in the furture, shall we directly use the new rust repository as the main repo? Or merge the rust repository back to current main repo?

Bevisy commented 2 years ago

@zmlcc By reading the above carefully, I think this is a significant and meaningful thing, and I am very grateful for Alibaba's action here. However, I would like to know if we have specific data to compare the resource consumption and performance of the rust and go versions of shim based on Alibaba's internal implementation, which might be more intuitive to illustrate the essential of the rust version.

bergwolf commented 2 years ago

So based on the above discussions, the immediate action for us is to create a separate repository to hold the new shim code so that it can be properly reviewed/refactored to meet the community requirement. Can we agree on this? @kata-containers/architecture-committee @ariel-adam @zmlcc @gnawux

We should also decide on a roadmap for the feature to get mature and be merged into the kata-containers repository, which can be based on @ariel-adam's comment and include the following things:

The pace to push new shim code to the new repository
A plan for additional refactoring and feature addition
CI integration
Packaging integration
Specific milestones for each step with the target to release before Berlin Summit next year

ariel-adam commented 2 years ago

The plan makes sense to me :-).

I think that since we are talking about a potential disruptive change then by creating a separate repository those developing the rust shim will be able to provide visibility to all the community without breaking anything immediate (and still avoiding the kata1.x/kata2.x branch issues such as duplicated CI we had in the past).

There is also the topic of getting developers more familiar with rust which will take time and this approach I believe gives our community that time.

fidencio commented 2 years ago

So based on the above discussions, the immediate action for us is to create a separate repository to hold the new shim code so that it can be properly reviewed/refactored to meet the community requirement. Can we agree on this?

Not yet, Tao. Let's think about the work it'll cause, and then we can have an agreement after that.

If we do this in a different repo, what does it mean in the future?

We'll get all the content from that repo and submit as PRs to the kata-containers repo?
We'll move on to the new repo as we did during 1.x era?

If we do 2, that will be a pain. We still have users going to wrong repos, regardless the fact they're marked as EOL. If we do 1. we will have to spend a reasonable amount of time setting up CIs for the new repo to ensure it has exactly the same functionality as the current one, which would be easier to do when dealing with a branch.

While on this, I sincerely fail to understand this comment here, @ariel-adam.

and still avoiding the kata1.x/kata2.x branch issues such as duplicated CI we had in the past

What happened during 1.x -> 2.x transition era was different repos, which is exactly what was suggested here. With a new branch approach it'd be easier to actually avoid CI duplication (with tweaks in the install / build scripts), and also to keep the visibility of the work coming in.

Still ... the most important thing here that I don't see being discussed is a maintenance plan for the go-shim, who will step up to maintain that, and for how long. I really want to see a list of the names / companies that will, for sure, help with that, and what are the responsibilities each company will own from now on.

Another thing I'd like to raise while having the discussion on what and how things will be merged, is having a solid plan on who will review this code. I really would like to suggest that each PR has at least one reviewer from Apple, IBM, Intel, and Red Hat. While this may slow down the process, this will ensure that anything going on will not cause a breakage for the players we have here. I'd go further and ask other companies to chime in and add reviewers to ensure their use case is and will keep working, that the work they're doing now will be reflected on the rust shim, and so on, and so forth Again, we need a serious commitment to do so, but I think it's doable, but we need a serious commitment with the companies involved with the project.

fidencio commented 2 years ago

So, just to sum up everything I wrote, I'd like to suggest:

we have a go-shim maintenance group, and we decide now which players will commit to that
we have a rust-shim reviewers group. and we decide now which players will commit to that
we consider using a branch rather than using a repo, as using a branch would be less CI work

bergwolf commented 2 years ago

Thanks, @fidencio! All your concerns are valid to me. As a starter, we at Ant Group are willing to sign up for the go-shim maintenance and rust-shim reviewers groups. And I'd recommend all key kata players to sign up too.

As for a new branch vs. a new repo, there were concerns about disrupting the kata-containers repository. But I agree with you that a new branch really can't break anything in the main branch. And given the fact that we'd be able to set up CI more easily, I agree with you that a new branch works better than a new repo. We are doing the same for the CCv0 branch and it works pretty well IMHO.

gnawux commented 2 years ago

@fidencio The summary is really appreciated.

c3d commented 2 years ago

@fidencio @gnawux After discussing with @ariel-adam, I still think that there would be value in having a separate repo for review purpose, but not a blank one. Instead, that could be a clone of the kata-containers repo (for example, it could be hosted on https://github.com/alibaba/kata-containers, forked from https://github.com/kata-containers/kata-containers.

The benefits I see to this approach are:

We would have one place to review Alibaba's code (not N places for individual Alibaba developers)
It would make it clear that this is still intended to be merged back into kata-containers/kata-containers ultimately
It would make the proposed repository internal structure quite clear (e.g. where they would add the Rust shim code)
Alibaba could leverage the existing tools, e.g. CI, without any disruption for the kata-containers team
Because of the shared history, rebasing on main on a regular basis would be straightforward (minimal conflicts expected since this would essentially be new directories and new files)
Anything that is immediately useful to Kata 2.x (say, adding some capabilities to the CI, or Rust agent fixes) could be submitted to kata-containers along the way
Alibaba would make it clear that it's their contribution (they deserve it).

Once we are happy with the state of the "Alibaba repo", we call it a release and merge it into the main kata-containers repo.

The alternative is to use a development branch on the main repository, like CCv0. That works as well, but may cause more "noise" initially for reviews. I already find it to be the case for CCv0 to some extent, despite our best efforts to identify that work clearly.

fidencio commented 2 years ago

@c3d, I don't fully understand your points, let me go through each one of those and raise questions.

We would have one place to review Alibaba's code (not N places for individual Alibaba developers)

What do you mean by "not N places for individual Alibaba developer"? And how would this be different than using a branch?

It would make it clear that this is still intended to be merged back into kata-containers/kata-containers ultimately

How would this be different than with a branch?

It would make the proposed repository internal structure quite clear (e.g. where they would add the Rust shim code)

I don't follow. When a PR is open you can easily see the internal structure, and this shouldn't be different whether one uses a repo or a branch.

Alibaba could leverage the existing tools, e.g. CI, without any disruption for the kata-containers team

So, this is an important point. Are you suggesting that CI should be entirely worked by Ant / Ali / Hyper folks? What's the difference on doing this with the branch and with the repo, according to your understanding?

Because of the shared history, rebasing on main on a regular basis would be straightforward (minimal conflicts expected since this would essentially be new directories and new files)

I don't see why it'd be different than what we'd get when using a branch, could you elaborate more on this?

Anything that is immediately useful to Kata 2.x (say, adding some capabilities to the CI, or Rust agent fixes) could be submitted to kata-containers along the way

Again, I don't see why it'd be different than what we'd get when using a branch, could you elaborate more on this?

Alibaba would make it clear that it's their contribution (they deserve it).

I really don't follow you, sorry. How using a branch would make it any different?

The alternative is to use a development branch on the main repository, like CCv0. That works as well, but may cause more "noise" initially for reviews. I already find it to be the case for CCv0 to some extent, despite our best efforts to identify that work clearly.

The "noise" is work being generated that we all should be committed to review. Having the "noise" generated in a different place, IMHO, may actually foggy the visibility of this effort.

All in all, I am really sorry, but the points you raised are not clear to me, at all, and going to this direction would cause an extra load at least on the CI side, as mentioned on https://github.com/kata-containers/kata-containers/issues/3052#issuecomment-978953449

bergwolf commented 2 years ago

@c3d

The alternative is to use a development branch on the main repository, like CCv0. That works as well, but may cause more "noise" initially for reviews. I already find it to be the case for CCv0 to some extent, despite our best efforts to identify that work clearly.

To add up to @fidencio's comments, we actually want to make such "noises", to get community attention and to get people familiar with the code Since it is a big feature for kata 3.0, we don't intend to hide it in any way. So I'm sorry if it bothers you, but it is the plan, not something we want to avoid.

c3d commented 2 years ago

We would have one place to review Alibaba's code (not N places for individual Alibaba developers)

What do you mean by "not N places for individual Alibaba developer"? And how would this be different than using a branch?

It would not make much of a difference for one branch. But since there are multiple developments going on, (Rust shim, sandboxing, rune, etc), I expect multiple topic branches to naturally emerge, not just one. So I'm looking for a way to group many related branches. The alternative would be a naming convention for branches, i.e. we could have v3/agent, v3/runtime, v3/shim and so on.

It would make it clear that this is still intended to be merged back into kata-containers/kata-containers ultimately

How would this be different than with a branch?

That point was primarily in contrast to a blank repo, not to a branch.

It would make the proposed repository internal structure quite clear (e.g. where they would add the Rust shim code)

I don't follow. When a PR is open you can easily see the internal structure, and this shouldn't be different whether one uses a repo or a branch.

Again, it would if there were multiple topic branches converging. In one case, all the related discussion would be intermixed with, say, the kata 2.5 release. In the other case, the split would be easier, helping people focusing on one task or the other.

Alibaba could leverage the existing tools, e.g. CI, without any disruption for the kata-containers team

So, this is an important point. Are you suggesting that CI should be entirely worked by Ant / Ali / Hyper folks? What's the difference on doing this with the branch and with the repo, according to your understanding?

For example, Alibaba could decide that in their repo, the set of green CI is a bit lower, at least initially, to reduce resource usage. This is just one example.

Because of the shared history, rebasing on main on a regular basis would be straightforward (minimal conflicts expected since this would essentially be new directories and new files)

I don't see why it'd be different than what we'd get when using a branch, could you elaborate more on this?

The point here is that it's not made harder by using a different repo. But there are also useful use cases. For example, say we introduce a change in v2 that breaks all the v3 branches. They only need one adjustment patch on "their" main that engineers working on v3 can rebase all the rest on.

Anything that is immediately useful to Kata 2.x (say, adding some capabilities to the CI, or Rust agent fixes) could be submitted to kata-containers along the way

Again, I don't see why it'd be different than what we'd get when using a branch, could you elaborate more on this?

It's really a matter of one branch versus many. I do not believe that Alibaba's effort would reasonably fit in a single branch. So I'm trying to find an organization letting us more easily create a "funnel" of related branches that ultimately converge towards v3. Of course, it can all technically be done in the same repo as the 2.x development, but it will make it much less clear what is 2.x and what is 3.x.

Just to give a single example related to that specific point: in the v3 repo, the meaning of "needs-backport" could mean "backporting to main". whereas in the v2 repo, it would mean "backporting to stable".

Alibaba would make it clear that it's their contribution (they deserve it).

I really don't follow you, sorry. How using a branch would make it any different?

There is a bit more visibility in having an alibaba repo with a lot of churn and a lot of commits by Alibaba. It's really secondary, but again, I find it weird that they would have an alibaba org on GitHub and not take advantage of it to showcase such a massive effort. Now, that's just me, I am not working for Alibaba and I don't have to tell them how to polish their brand ;-)

The alternative is to use a development branch on the main repository, like CCv0. That works as well, but may cause more "noise" initially for reviews. I already find it to be the case for CCv0 to some extent, despite our best efforts to identify that work clearly.

The "noise" is work being generated that we all should be committed to review. Having the "noise" generated in a different place, IMHO, may actually foggy the visibility of this effort.

To respond to your comment and @bergwolf at the same time: the noise in question relates to reviews of code that is not mature vs. code that is mature and needs to go in quickly for releases. Many developers filter patches differently based on whether they apply to the current release or whether it's a long-term review where many iterations are needed.

So by "noise', I'm not talking about "hype", but about ongoing disruption of the primary release tool for the project. I am trying to reconcile two objectives:

I want to see the code and be able to comment on it, with the assumption that it's still somewhat immature.
I don't want that to interfere with (introduce noise in ) the mature code that we need to maintain and keep stable.

All in all, I am really sorry, but the points you raised are not clear to me, at all, and going to this direction would cause an extra load at least on the CI side, as mentioned on #3052 (comment)

The point in that comment is totally irrelevant, since it applies to different repos with no shared history. I am specifically suggesting a cloned repo, which IMO renders your line of argumentation moot. Please clarify if you think I missed something in your CI-related comment.

Quite to the contrary, I believe that some specific CI adjustments are hard to do on a per-branch basis. The baseline to use for v2 vs v3 is an example.

Bottom line, to me it really boils down to the following belief: I do not believe that reviewing the v3 code can or should be done with a single branch, and having many v3 branches in the primary repo will require a lot of conventions.

If we go with the single repo rout, may I suggest we need to discuss the following conventions:

All branches related to the v3 effort should be prefixed with v3/ (just the same, CCv0 could be renamed cc/v0).
We should have a v3/main which would be a regular rebase of main with whatever additional fixes are needed
We should understand exactly how we can have a more lenient baseline for v3/ branches
We should agree on what backport means for v3/ branches (to main or to stable- ?)
We should agree on how to label something that needs to be forward-ported to v3 (or if we even need to, given that regular rebasing would make this useless)
We should agree on systematic tagging of v3 issues and PRs
We need a breaksV3 label or something, and we need to separate the api-breakage into "expected" (for V3) and "unexpected" (for V2 bugs).

So yes, all this can be done before we start reviewing Alibaba's code. I frankly feel that a repository clone to get started is a tad bit simpler :-)

bergwolf commented 2 years ago

It would not make much of a difference for one branch. But since there are multiple developments going on, (Rust shim, sandboxing, rune, etc), I expect multiple topic branches to naturally emerge, not just one. So I'm looking for a way to group many related branches. The alternative would be a naming convention for branches, i.e. we could have v3/agent, v3/runtime, v3/shim and so on.

Now I see why you had these thoughts. IMO what you described (as "multiple topic branches") are pull requests rather than branches. And I think we can clarify that we only plan to maintain a single branch for the rust shim + sandboxing code. And there is no rune (not sure if you mean the one that Alibaba Cloud uses for enclave containers, but it is certainly out of the scope).

So by "noise', I'm not talking about "hype", but about ongoing disruption of the primary release tool for the project. I am trying to reconcile two objectives: I want to see the code and be able to comment on it, with the assumption that it's still somewhat immature. I don't want that to interfere with (introduce noise in ) the mature code that we need to maintain and keep stable.

The code is being developed in a separate branch and will not be merged into the main branch until it is mature enough. I don't see why it would disrupt the primary release tool for the project, which is supposed to live in the main branch. So I'm not sure what is the interfere with the mature code part about.

Bottom line, to me it really boils down to the following belief: I do not believe that reviewing the v3 code can or should be done with a single branch, and having many v3 branches in the primary repo will require a lot of conventions.

We have a single main branch and so many ongoing features for kata in different pull requests. None of them is asking for new branches. Maintaining a single branch is actually easier than having multiple branches. With multiple branches, we'll have to spend more time syncing things between them and setting up CI for them. I don't think we should go for it.

fidencio commented 2 years ago

Let's take a step back here and try to get everyone on the same page and understanding of how CI works, as I think it's not clear, and as it's essential for this discussion to be productive.

When we talk about CIs, we're talking about 2 things here:

GitHub actions
Jenkins CI

GitHub actions: Those are the ones which are located here, and the currently tests running are:
- Commit Message Check
- Ensure PR has required porting labels
- Pull request WIP checks
- Static checks (x2)
- 1.16.x
- 1.17.x
- kata deploy build (x10)
- kernel
- kernel-experimental
- shim-v2
- qemu
- cloud-hypervisor
- firecracker
- rootfs-image
- rootfs-initrd
- make-kata-tarball
- create-kata-tarball
- snap CI
These actions are inherited by the forks and depending on what secrets are used on each one of those, they may or may not work correctly on the forked repo.
Jenkins CI Those are ones which are located here and you can follow their execution here, and the currently jobs running are:
- jenkins-ci-ARM-ubuntu-18-04
- jenkins-ci-centos-8
- jenkins-ci-fedora-32-crio
- jenkins-ci-fedora-32-crio-k8s-e2e-minimal
- jenkins-ci-firecracker-ubuntu-18-04
- jenkins-ci-pmem-ubuntu-18-04
- jenkins-ci-qat-ubuntu-18-04
- jenkins-ci-s390x-ubuntu-20-04
- jenkins-ci-ubuntu-1804-cloud-hypervisor-k8s-containerd-2.0
- jenkins-ci-20-04
- jenkins-ci-ubuntu-clh-crio-18-04
- jenkins-metrics-ubuntu-20-04
- jenkins-vfio-host:ubuntu18-04-vm:fedora32
Those jobs are tied to a repo (in this case, kata-containers), and to an organization (in our case kata-containers), and we have the freedom to decide on which branches they run based on the configuration of each job itself. For instance, let's take a look at the jenkins-ci-ARM-ubuntu-18-04, there you have an allow-list saying that we run that job on main and stable-2.* branches, and a deny list saying that we don't run that job on master and stable-1.* branches.

Adding a new repo inside the same organization, would require adding and maintaining the new job, but adding a new branch would require tweaking the allow list. Adding those to a new organization is something that I do not know how much work would be, but I assume more than adding a new repo, as we might have to configure jenkins, we might have to configure new groups, new permissions, etc. But adding a new branch would mean adding a new field in the "allow list" mentioned above.

Everyone, please, have this in mind when discussing the effort needed when we think about the CI.

fidencio commented 2 years ago

Now, with everything mentioned as part of https://github.com/kata-containers/kata-containers/issues/3052#issuecomment-979177240, I think we should target having both set of jobs running. But if we had to choose one to keep, I'd keep testing against the Jenkins ones, as those are the baseline for the Green CI effort and, although not perfect, the most complete thing we have.

bergwolf commented 2 years ago

I think we should keep both. Although at first, the new code may not be able to pass either of them, we should target having both of them working when merging the code back to the main branch.

c3d commented 2 years ago

Now I see why you had these thoughts. IMO what you described (as "multiple topic branches") are pull requests rather than branches. And I think we can clarify that we only plan to maintain a single branch for the rust shim + sandboxing code.

I do distinguish topic branches from PRs.

A topic branch is something that is not mergeable yet, but where you invite review on a number of related commits, or even invite contributions from other developers to complete the work. It can be quite long-lived. CCv0 is our prime example today.

A PR is something that is ready to be merged. We sometimes use a "WIP PR" to advertise a topic branch. But you can have a PR to a topic branch, we do that all the time for CCv0. A PR is not generally intended to be long-lived.

And there is no rune (not sure if you mean the one that Alibaba Cloud uses for enclave containers, but it is certainly out of the scope).

Sorry, I meant rund. Too many run[a-z] variants these days (runc, rund, rune, runk, etc). Though my only knowledge about rund (like rune) is mostly its appearance on a slide. Or is there no rund either?

fidencio commented 2 years ago

The point in that comment is totally irrelevant, since it applies to different repos with no shared history. I am specifically suggesting a cloned repo, which IMO renders your line of argumentation moot. Please clarify if you think I missed something in your CI-related comment.

@c3d, please, see: https://github.com/kata-containers/kata-containers/issues/3052#issuecomment-979177240

c3d commented 2 years ago

@fidencio I see why we reach different conclusions now. You are saying that it's easy to leverage existing jobs, and I agree with that. However, to my earlier question on the topic, @zmlcc replied:

Mostly based on our internal CI and fully tested in production environment

So my understanding is that the code as it exists today would not pass the existing Jenkins jobs. Or maybe it would, but that would not test the new pieces.

@zmlcc @bergwolf : this leads to three questions:

Do you think that your existing CI for the new code is something that you could share publicly (e.g. could it be turned into Jenkins jobs running on our infrastructure)?
Do you think that there are reasonable chances that the existing code could pass our current Jenkins CI (most likely testing only the existing Go components which would be largely untouched)?
Do you think that there are reasonable chance that the existing code could pass a large enough fraction of our existing CI using the Rust shim as a "drop-in" replacement (i.e. running the old tests with the new shim) ?

bergwolf commented 2 years ago

A topic branch is something that is not mergeable yet, but where you invite review on a number of related commits, or even invite contributions from other developers to complete the work. It can be quite long-lived. CCv0 is our prime example today.

Ok, in that case, the rust shim stays in a single branch just like CCv0.

Sorry, I meant rund. Too many run[a-z] variants these days (runc, rund, rune, runk, etc). Though my only knowledge about rund (like rune) is mostly its appearance on a slide. Or is there no rund either?

rund is just the combination of rust shim and builtin sandbox.

bergwolf commented 2 years ago

So my understanding is that the code as it exists today would not pass the existing Jenkins jobs. Or maybe it would, but that would not test the new pieces.

We need to see what passes and what fails and fix up the failure part.

As for your three questions, I'll let @zmlcc answer them. But IMO the internal CIs should be an addition to the existing Jenkins job, not a replacement, if it can be run in a public environment.

c3d commented 2 years ago

A topic branch is something that is not mergeable yet, but where you invite review on a number of related commits, or even invite contributions from other developers to complete the work. It can be quite long-lived. CCv0 is our prime example today.

Ok, in that case, the rust shim stays in a single branch just like CCv0.

What about work on extensions, image acceleration or plugins? Would that all be lumped into the same big branch (let's call it v3 for clarity)?

Sorry, I meant rund. Too many run[a-z] variants these days (runc, rund, rune, runk, etc). Though my only knowledge about rund (like rune) is mostly its appearance on a slide. Or is there no rund either?

rund is just the combination of rust shim and builtin sandbox.

That's more or less what I understood. So how do you plan to present that to the community?

To me, there are two really different needs in terms of review:

Reviewing code that you already put in production, but never made public. For this, we need to see the whole thing (therefore a branch), so that we can collectively decide how to improve the code before making calling that v3 (that would be small PRs that fix things, not merge them into main but into v3).
Developing new features that we feel belong to v3 but do not exist yet. Obviously, you cannot share a big branch for those, but I think it would make things really difficult for the rest of the community if these other efforts were not clearly delineated and organized. So I would expect topic branches like v3/image-acceleration where PRs would merge features, not fixes or review comments.

The way I see it, you are suggesting the two at the same time, and to me (at least in my brain), this conflicts.

bergwolf commented 2 years ago

What about work on extensions, image acceleration or plugins? Would that all be lumped into the same big branch (let's call it v3 for clarity)?

Those are nice to have but not the initial target for 3.0. And they should go in as PRs if they ever have a chance to be part of 3.0.

Reviewing code that you already put in production, but never made public. For this, we need to see the whole thing (therefore a branch), so that we can collectively decide how to improve the code before making calling that v3 (that would be small PRs that fix things, not merge them into main but into v3).

This is the main part.

Developing new features that we feel belong to v3 but do not exist yet. Obviously, you cannot share a big branch for those, but I think it would make things really difficult for the rest of the community if these other efforts were not clearly delineated and organized. So I would expect topic branches like v3/image-acceleration where PRs would merge features, not fixes or review comments.

The image-acceleration part in Gerry's presentation is mostly for CC. So it may just target at CCv0 first, and it is actually beyond the scope of the proposal here (rust shim + builtin sandbox).

jiangliu commented 2 years ago

@gnawux Here are a few ideas and comments

The description seems to indicate that the shim reimplementation is well advanced already, and that the remaining work is sandboxing. Is that correct?

Assuming this is correct, was there any external repository for the development of that shim code so far, or was this all developed internally? If it was developed internally, was it in its own repository, or are there repository dependencies to Kata? Originally it's developed internally and is independent of Kata. But we are making plans to refactor the code for upstreaming.

When you talk about "passing the same test cases", do you mean you can run the existing Kata CI, or some other series of tests?

If there is a large body of code already, how do you suggest we should review it? It makes sense that the community would need to take time to review it. As @fidencio remarked during the meeting, all developers don't have equivalent familiarity with Rust and Go We are trying to split the code into pieces, so it could be easily reviewed and merged by upstream:)

Thanks for attempting such major work.

jiangliu commented 2 years ago

@fidencio I see why we reach different conclusions now. You are saying that it's easy to leverage existing jobs, and I agree with that. However, to my earlier question on the topic, @zmlcc replied:

Mostly based on our internal CI and fully tested in production environment

So my understanding is that the code as it exists today would not pass the existing Jenkins jobs. Or maybe it would, but that would not test the new pieces.

@zmlcc @bergwolf : this leads to three questions:

Do you think that your existing CI for the new code is something that you could share publicly (e.g. could it be turned into Jenkins jobs running on our infrastructure)?

The Kata CI and our CI have many common parts, but also have some different interests too. So it definitely helps to share the common parts. We are glad to contribute CI test cases to the community, and it also be great to run the community CI system in our QA systems too if it's feasible.

Do you think that there are reasonable chances that the existing code could pass our current Jenkins CI (most likely testing only the existing Go components which would be largely untouched)?

Do you think that there are reasonable chance that the existing code could pass a large enough fraction of our existing CI using the Rust shim as a "drop-in" replacement (i.e. running the old tests with the new shim) ?

Need more investigation to figure out the status.

zmlcc commented 2 years ago

@fidencio I see why we reach different conclusions now. You are saying that it's easy to leverage existing jobs, and I agree with that. However, to my earlier question on the topic, @zmlcc replied:

Mostly based on our internal CI and fully tested in production environment

So my understanding is that the code as it exists today would not pass the existing Jenkins jobs. Or maybe it would, but that would not test the new pieces.

@zmlcc @bergwolf : this leads to three questions:

Do you think that your existing CI for the new code is something that you could share publicly (e.g. could it be turned into Jenkins jobs running on our infrastructure)?

Surely Alibaba can share the internal CI, just like sharing the rust shim code. But in my opinion, we want to use the existing Kata CI to ensure that the new code is functionally consistent with the existing one. Ali's internal CI would not help this.

Do you think that there are reasonable chances that the existing code could pass our current Jenkins CI (most likely testing only the existing Go components which would be largely untouched)?

Do you think that there are reasonable chance that the existing code could pass a large enough fraction of our existing CI using the Rust shim as a "drop-in" replacement (i.e. running the old tests with the new shim) ?

No. Firstly our internal code development is independent of Kata, so we connot guarantee Kata's CI fits our internal code at all and vice versa. Secondly, we want to gradually release out the rust code. In the initial stage only some basic functionalities are included, which certainly cannot pass all the full-featured Kata CI.

Actually I suggest using seperate CI for the new code. As the new code development, we can gradually add test cases into the seperate CI until the new CI is consistant with the existing CI. At that time, we can say the new code is mature enough to merge into the main repo/branch

mxpv commented 2 years ago

Apologies for being a bit late for the party. We've (containerd) just open sourced rust extensions for containerd that include crates for ttrpc/grpc clients, protos, shim, etc. Is this something you'd be potentially interested to consolidate efforts and maintain one crate?

liubin commented 2 years ago

It's an honor that we implemented this feature and it's up to me to close this issue.