google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.82k stars 1.3k forks source link

Support cgroup v2 in runsc #3481

Closed fvoznika closed 2 years ago

fvoznika commented 4 years ago

runcs uses cgroups V1 to set pod limits. Kubernetes is switching over to use cgroups V2, it's alpha in 1.19 and will possibly hit beta in 1.20.

Relevant links: SIG-node cgroups KEP containerd issue runc issue

majek commented 4 years ago

Hi, we are slowly thinking about cgroups v2, it would be nice to know if this is on the roadmap.

fvoznika commented 4 years ago

This work is not staffed right now. We're planning to pick this up early next year.

dqminh commented 3 years ago

@fvoznika has there been any progress on this issue ? I'm planning to spend some time to work on this if possible ( since we are planning to migrate to cgroupv2 very soon ), so wonder if we can wait or start a collaboration effort on this.

fvoznika commented 3 years ago

No progress yet. It would be great if you could get started on it.

dqminh commented 3 years ago

Looking at this, i think my plan roughly is:

fvoznika commented 3 years ago

Thanks for spelling out your plan. We try to avoid adding dependencies as much as possible to have tight control over the code that is included in runsc. See Security principles for more details.

So instead of replacing runsc/cgroup, it could be extended to support cgroups v2. The exported functions in cgroup.Cgroup can move to an interface that has distinct implementations for v1 and v2. Something like this:

type Cgroup interface {
  Install(res *specs.LinuxResources) error
  Uninstall() error
  Join() (func(), error)
  CPUQuota() (float64, error)
  NumCPU() (int, error)
  MemoryLimit() (uint64, error)
}

Re: testing, that's a good question. We have cgroups integration test in root/cgroup_test.go. We can make sure that the images used to run this test has support for cgroups v2, otherwise nested virtualization is also an option.

dqminh commented 3 years ago

@fvoznika some updates. First of all, it's working

[vagrant@localhost vagrant]$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)
[vagrant@localhost vagrant]$ docker run --cpu-shares 4096 --memory 128m -it --runtime runsc hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

[vagrant@localhost vagrant]$ docker run --cpu-shares 4096 --memory 128m -it --runtime runsc debian bash
root@fdf96bc8e8dc:/#

I indeed abandoned the requirement for cgroupv1 changes to reuse libcontainer's cgroup interface, since that's quite complicated to do 1-1 feature set and still preserve backward compatibility. We use libcontainer's cgroup interface only for v2 and switch back and forth depends on the v2 detection. The current interface is:

type Cgroup interface {
  Install(name string, res *specs.LinuxResources) error
  Uninstall(name string,) error
  Join(name string,) (func(), error)
  CPUQuota(name string,) (float64, error)
  NumCPU(name string,) (int, error)
  MemoryLimit(name string,) (uint64, error)
}

type cgroupV2Manager struct {
    manager libcontainercgroups.Manager
}

I'm passing name in to reconstruct the libcontainercgroups.Manager object with each call if necessary. Then in the cgroup code we do

if libcontainercgroups.IsCgroup2UnifiedMode() {
  // do v2
} else {
  // do v1 
}

Now we are at the stage where we figure out how to pass most integration tests. I don't think the images will need any additional support, just that the integration tests will need to be adjusted because not all v1 values will be mapped to v2. Look like the CRI setup will need some changes too. I'm testing this inside a vagrant VM similar to how containerd/runc is doing this, so it can be mapped into CI that can support nested virtualizations.

avagin commented 3 years ago

I've created the feature branch https://github.com/google/gvisor/tree/feature/cgroupv2.

Let's continue the cgroupv2 development there. Then when it will be ready, we will merge it to the master branch.

TODO list:

This list is based on @fvoznika comments for #5453 that have not been addressed.

dqminh commented 3 years ago

Hi again ! Sorry for some inactive period, i was busy with some other projects.

@avagin https://github.com/google/gvisor/tree/feature/cgroupv2 is good, what's the development process here ? I think maybe we can split the patchset into 2 parts ( 1 is to bump dependencies and create cgroup interface for v1 and v2 cooperations ), and the second is to add v2 support.

Run cgroup tests.

The PR uses vagrant to setup a v2 environment. It would be great if someone with CI access can setup that up, either with vagrant, or with a build agent that runs cgroupv2. I don't have CI access so the feedback loop is terrible here.

Remove external dependencies.

I think ideally we want to have some shared libraries here that different cgroup consumers can use. Currently it's uses runc cgroupv2 implementation. But there's also desire to unify the cgroup implementation with containerd/cgroups ( see https://github.com/opencontainers/runc/issues/3007 ). Is that acceptable ?

Bumping up containerd to 1.4 breaks compatibility with 1.3.

I will need to take a look at this again to see if we can still keep 1.3 compat (maybe possible but we have to reimplement a bunch of things iirc ). The simplest option is of course to bump required version of containerd to 1.4, is there any plan to do that ?

avagin commented 3 years ago

Hi again ! Sorry for some inactive period, i was busy with some other projects.

@avagin https://github.com/google/gvisor/tree/feature/cgroupv2 is good, what's the development process here ?

All new PR-s about cgroupv2 should be created to this branch.

I think maybe we can split the patchset into 2 parts ( 1 is to bump dependencies and create cgroup interface for v1 and v2 cooperations ), and the second is to add v2 support.

It is up to you, but you need to remember that we want to avoid any new external dependencies without real reasons. We can consider to copy-paste some code from runc, I think the license allows us to do this.

Run cgroup tests.

The PR uses vagrant to setup a v2 environment. It would be great if someone with CI access can setup that up, either with vagrant, or with a build agent that runs cgroupv2. I don't have CI access so the feedback loop is terrible here.

I will help with this, but let's solve other todo-s first.

Remove external dependencies.

I think ideally we want to have some shared libraries here that different cgroup consumers can use. Currently it's uses runc cgroupv2 implementation. But there's also desire to unify the cgroup implementation with containerd/cgroups ( see opencontainers/runc#3007 ). Is that acceptable ?

It depends on a few things. The main idea is that we want to be able to review all code that we use. It means that a new library should have a limit number of new external dependencies and it has to be relatively small (does minimal things that we will not use).

dqminh commented 3 years ago

I'm repackaging the patchset to make reviewing and testing simpler:

  1. https://github.com/google/gvisor/pull/6485 to bump containerd dependencies to 1.4 without any changes. I think this still satisfies our requirements i.e. should work with containerd 1.3 runtime. This should reduce some code that we need for the shim.
  2. https://github.com/google/gvisor/pull/6499 this ports the cgroup interface to use v1
  3. Next step would be to write cgroupv2 patch based on our past work, I'm rewriting the patch a little bit to remove external dependency in libcontainer. Once 2) is merged we can based off feature/cgroupv2 on that, and if @avagin can help add the test environment for cgroupv2 that would be great.
dqminh commented 3 years ago

@fvoznika @avagin I have repackaged the work into 2 PRs. We don't need to bump any extra dependencies now.

6499 adds the common cgroup interface for v1 and v2

6821 adds the cgroupv2 implementation

We need a cgroupv2 environment to run the tests. Can you help with that ?

avagin commented 3 years ago

We need a cgroupv2 environment to run the tests. Can you help with that ?

I will help with that. I am going to add cgroup2 workers in buildkite.

avagin commented 2 years ago

https://github.com/google/gvisor/pull/6884