lifubang commented 1 year ago

Description

With runc built from the main branch: If I create 450 containers in a 8G memory node, all the memory will be eat by runc. The node will be in stuck state.

With runc 1.1.8, it will success.

I think when using memfd_create, runc binary will be in memory, so if we batch create containers, it will eat node's memory before the container started. In my head, maybe containerd uses runc create for the first step when run a container.

Steps to reproduce the issue

make runc binary with main branch
use a bash script to start 450 containers

the bash script:

#!/bin/bash
for((i=1; i<=450; i ++))
do
        /opt/runc create test"$i" &
done

Describe the results you received and expected

received: The node is in stuck state, because there is no enough memory left in the node.

expected: All containers should be created normally.

What version of runc are you using?

runc version 1.1.0+dev commit: v1.1.0-688-g74c125d8 spec: 1.1.0 go: go1.18.4 libseccomp: 2.5.1

Host OS information

PRETTY_NAME="Ubuntu 22.04.2 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.2 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

Host kernel information

Linux iZ2ze4a8qvjqt6lt7ommf8Z 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

lifubang commented 1 year ago

@cyphar PTAL, thanks.

rata commented 1 year ago

@lifubang does reverting those commit fixes the issue?

lifubang commented 1 year ago

does reverting those commit fixes the issue?

Yes, it works very well after reverted #3931 .

cyphar commented 1 year ago

I can add the contrib program to bind-mount the memfd which should eliminate this issue (see the discussion in the PR). I didn't add it because the primary concern last time was container memory usage, and memfd_create(2) no longer causes issues. The only downside is you need to have a small daemon running on the system.

I really do not want to have to keep bindfd...

113xiaoji commented 1 year ago

I encountered a similar issue. When using the main branch, my test case failed. However, when I solely used the 'remove bindfd' commit, my test case passed. I suspect that this PR might be conflicting with other PRs on the main branch, leading to a new performance issue. I'd recommend not using the main branch and instead, trying only @cyphar's PR, @lifubang.

kolyshkin commented 1 year ago

Oh my, this is tough. Let's discuss further in https://github.com/opencontainers/runc/pull/3931

lifubang commented 1 year ago

However, when I solely used the 'remove bindfd' commit, my test case passed.

With your suggestion, it can't fix my issue.

I suspect that this PR might be conflicting with other PRs on the main branch

I think maybe not this problem. I think obviously it's a design problem, but not a code bug, cyphar has confirmed it, and he has had a solution to resolve this issue.

So, maybe your issue is not the same as mine. You can discuss it further in #3931 or open an new issue once you know how to reproduce it clearly.

113xiaoji commented 1 year ago

However, when I solely used the 'remove bindfd' commit, my test case passed.

With your suggestion, it can't fix my issue.

I suspect that this PR might be conflicting with other PRs on the main branch

I think maybe not this problem. I think obviously it's a design problem, but not a code bug, cyphar has confirmed it, and he has had a solution to resolve this issue.

So, maybe your issue is not the same as mine. You can discuss it further in #3931 or open an new issue once you know how to reproduce it clearly.

The node is in stuck state. Are there any logs?

113xiaoji commented 1 year ago

However, when I solely used the 'remove bindfd' commit, my test case passed.

With your suggestion, it can't fix my issue.

I suspect that this PR might be conflicting with other PRs on the main branch

I think maybe not this problem. I think obviously it's a design problem, but not a code bug, cyphar has confirmed it, and he has had a solution to resolve this issue.

So, maybe your issue is not the same as mine. You can discuss it further in #3931 or open an new issue once you know how to reproduce it clearly.

Maybe you could test with @cyphar's suggestion. I show you memfd code afternoon.

cyphar commented 1 year ago

I will post a PR, but the code in question is https://github.com/opencontainers/runc/pull/3931#discussion_r1259665352. If you use it to mount on top of the runc binary, you should be able to run as many runc processes as you like without extra memory overhead. The downside is that the program needs to keep running as a daemon -- if the process dies, you cannot exec the mounted runc path (and package updates are more annoying -- you need to stop the daemon, upgrade the runc package, then restart the daemon).

113xiaoji commented 1 year ago

memfd code is

package main

import (
   "errors"
   "fmt"
   "io"
   "os"
   "os/signal"
   "runtime"
   "strings"
   "time"

   "github.com/urfave/cli"
   "golang.org/x/sys/unix"
)

// version will be populated by the Makefile, read from
// VERSION file of the source code.
var version = ""

// gitCommit will be the hash that the binary was built from
// and will be populated by the Makefile
var gitCommit = ""

const (
   usage = `Open Container Initiative contrib/cmd/memfd-bind

In order to protect against certain container attacks, every runc invocation
that involves creating or joining a container will cause runc to make a copy of
the runc binary in memory (usually to a memfd). While "runc init" is very
short-lived, this extra memory usage can cause problems for containers with
very small memory limits (or containers that have many "runc exec" invocations
applied to them at the same time).

memfd-bind is a tool to create a persistent memfd-sealed-copy of the runc binary,
which will cause runc to not make its own copy. This means you can get the
benefits of using a sealed memfd as runc's binary (even in a container breakout
attack to get write access to the runc binary, neither the underlying binary
nor the memfd copy can be changed).

To use memfd-bind, just specify which path you want to create a socket path at
which you want to receive terminals:

    $ sudo memfd-bind /usr/bin/runc

Note that (due to kernel restrictions on bind-mounts), this program must remain
running on the host in order for the binary to be readable (it is recommended
you use a systemd unit to keep this process around).

If this program dies, there will be a leftover mountpoint that always returns
-EINVAL when attempting to access it. You need to use memfd-bind --cleanup on the
path in order to unmount the path (regular umount(8) will not work):

    $ sudo memfd-bind --cleanup /usr/bin/runc

Note that (due to restrictions on /proc/$pid/fd/$fd magic-link resolution),
only privileged users (specifically, those that have ptrace privileges over the
memfd-bind daemon) can access the memfd bind-mount. This means that using this
tool to harden your /usr/bin/runc binary would result in unprivileged users
being unable to execute the binary. If this is an issue, you could make all
privileged process use a different copy of runc (by making a copy in somewhere
like /usr/sbin/runc) and only using memfd-bind for the version used by
privileged users.
`
)

func cleanup(path string) error {
   file, err := os.OpenFile(path, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
   if err != nil {
      return fmt.Errorf("cleanup: failed to open runc binary path: %w", err)
   }
   defer file.Close()
   fdPath := fmt.Sprintf("/proc/self/fd/%d", file.Fd())

   // Keep umounting until we hit a umount error.
   for unix.Unmount(fdPath, unix.MNT_DETACH) == nil {
      // loop...
   }
   fmt.Fprintf(os.Stdout, "memfd-bind: path %q has been cleared of all old bind-mounts\n", path)
   return nil
}

func memfdClone(path string) (*os.File, error) {
   binFile, err := os.Open(path)
   if err != nil {
      return nil, fmt.Errorf("memfd clone: failed to open runc binary path: %w", err)
   }
   defer binFile.Close()

   const memfdComment = "runc_cloned:/proc/self/exe"
   memfd, err := unix.MemfdCreate(memfdComment, unix.MFD_CLOEXEC|unix.MFD_EXEC|unix.MFD_ALLOW_SEALING)
   if errors.Is(err, unix.EINVAL) {
      // old kernel with no MFD_EXEC support
      memfd, err = unix.MemfdCreate(memfdComment, unix.MFD_CLOEXEC|unix.MFD_ALLOW_SEALING)
   }
   if err != nil {
      return nil, fmt.Errorf("memfd clone: failed to create memfd: %w", err)
   }
   memfdFile := os.NewFile(uintptr(memfd), memfdComment)
   // we *do not* defer memfdFile.Close() here...

   // Copy the contents...
   for {
      _, err = io.Copy(memfdFile, binFile)
      if !errors.Is(err, unix.EINTR) {
         break
      }
   }
   if err != nil {
      return nil, fmt.Errorf("memfd clone: failed to copy binary text to memfd: %w", err)
   }

   // Seal the memfd. We ignore errors for the newer seals, but for the
   // RUNC_MEMFD_MIN_SEALS set we will return an error. This matches the
   // behaviour of libcontainer/nsenter/cloned_binary.c.
   const RUNC_MEMFD_MIN_SEALS = unix.F_SEAL_SEAL | unix.F_SEAL_SHRINK | unix.F_SEAL_GROW | unix.F_SEAL_WRITE
   const F_SEAL_EXEC = 0x20 // from <linux/fcntl.h>
   _, _ = unix.FcntlInt(uintptr(memfd), unix.F_ADD_SEALS, unix.F_SEAL_FUTURE_WRITE)
   _, _ = unix.FcntlInt(uintptr(memfd), unix.F_ADD_SEALS, F_SEAL_EXEC)
   _, err = unix.FcntlInt(uintptr(memfd), unix.F_ADD_SEALS, RUNC_MEMFD_MIN_SEALS)
   if err != nil {
      return nil, fmt.Errorf("memfd clone: failed to add required seals to memfd: %w", err)
   }
   fmt.Fprintf(os.Stdout, "memfd-bind: created sealed memfd with contents of %q\n", path)
   return memfdFile, nil
}

func mount(path string) error {
   memfdFile, err := memfdClone(path)
   if err != nil {
      return err
   }
   defer memfdFile.Close()
   memfdPath := fmt.Sprintf("/proc/self/fd/%d", memfdFile.Fd())

   // We have to open an O_NOFOLLOW|O_PATH to the memfd magic-link because we
   // cannot bind-mount the memfd itself (it's in the internal kernel mount
   // namespace and cross-mount-namespace bind-mounts are not allowed). This
   // also requires that this program stay alive continuously for the
   // magic-link to stay alive...
   memfdLink, err := os.OpenFile(memfdPath, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
   if err != nil {
      return fmt.Errorf("mount: failed to /proc/self/fd magic-link for memfd: %w", err)
   }
   defer memfdLink.Close()
   memfdLinkFdPath := fmt.Sprintf("/proc/self/fd/%d", memfdLink.Fd())

   exeFile, err := os.OpenFile(path, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
   if err != nil {
      return fmt.Errorf("mount: failed to open target runc binary path: %w", err)
   }
   defer exeFile.Close()
   exeFdPath := fmt.Sprintf("/proc/self/fd/%d", exeFile.Fd())

   err = unix.Mount(memfdLinkFdPath, exeFdPath, "", unix.MS_BIND, "")
   if err != nil {
      return fmt.Errorf("mount: failed to mount memfd on top of runc binary path target: %w", err)
   }

   // If there is a signal we want to do cleanup.
   sigCh := make(chan os.Signal, 1)
   signal.Notify(sigCh, os.Interrupt, unix.SIGTERM, unix.SIGINT)
   go func() {
      <-sigCh
      fmt.Fprintf(os.Stdout, "memfd-bind: exit signal caught! cleaning up the bind-mount on %q...\n", path)
      _ = cleanup(path)
      os.Exit(0)
   }()

   // Clean up things we don't need...
   _ = exeFile.Close()
   _ = memfdLink.Close()

   // We now have to stay alive to keep the magic-link alive...
   fmt.Fprintf(os.Stdout, "memfd-bind: bind-mount of memfd over %q created -- looping forever!\n", path)
   for {
      // loop forever...
      time.Sleep(time.Duration(1<<63 - 1))
      // make sure the memfd isn't gc'd
      runtime.KeepAlive(memfdFile)
   }
}

func main() {
   app := cli.NewApp()
   app.Name = "memfd-bind"
   app.Usage = usage

   // Set version to be the same as runC.
   var v []string
   if version != "" {
      v = append(v, version)
   }
   if gitCommit != "" {
      v = append(v, "commit: "+gitCommit)
   }
   app.Version = strings.Join(v, "\n")

   // Set the flags.
   app.Flags = []cli.Flag{
      cli.BoolFlag{
         Name:  "cleanup",
         Usage: "Do not create a new memfd-sealed file, only clean up an existing one at <path>.",
      },
   }

   app.Action = func(ctx *cli.Context) error {
      args := ctx.Args()
      if len(args) != 1 {
         return errors.New("need to specify a single path to the runc binary")
      }
      path := ctx.Args()[0]

      err := cleanup(path)
      // We only care about cleanup errors when doing --cleanup.
      if ctx.Bool("cleanup") {
         return err
      }
      return mount(path)
   }
   if err := app.Run(os.Args); err != nil {
      fmt.Fprintf(os.Stderr, "memfd-bind: %v\n", err)
      os.Exit(1)
   }
}

@lifubang

lifubang commented 1 year ago

I have tried it, it works very well. My question is that this contrib program is for personal use purpose? I don't know whether other upstream projects, such as containerd, docker, and k8s will change their install script to use it or not? If yes, there will be lots of work to do for these projects.

I think this issue may impact some k8s/dockerswarm cluster when the node is in case of insufficient memory state. So I suggust to revert it before we have an new solution. But if we confirm that it will not impact upstream project, I think we can consider add a contrib program for anyone who needs it.

cyphar commented 1 year ago

The problem is that we cannot just revert it:

The version post-#3931 has this one issue that can be completely fixed for systems that require you to have so many runc processes running by having a daemon on the system. In an absolute worst case, we can look at getting Docker/containerd/cri-o to do the memfd bind-mount in their daemons in order to make the setup be automatic.
The version pre-#3931 had performance issues because #3599 made the bindfd happen in a temporary mount namespace which caused serious contention on mount_lock on systems with loads of runc programs running.
The version pre-#3599 had other issues with hanging because systemd parses mountinfo and having all of these temporary ro bind-mounts causes the systemd process to stall which causes container creation to also stall.
Not having this code leaves runc vulnerable to the /proc/self/exe CVE, at least not until I get the magic-link hardening work I'm working on merged into the kernel.

From where I'm standing, it seems to me that #3931 is the least bad option. Every option has some kind of cost, but at least the memfd one can be mitigated by users -- the mount-based ones affect you and you cannot disable them.

I think this issue may impact some k8s/dockerswarm cluster when the node is in case of insufficient memory state.

If a node cannot have 10MB in memory per container during setup, I don't think the node has enough memory to actually run the container process. Also I don't think Docker, containerd, cri-o or Kubernetes keep processes in the runc create state for long periods of time. The "created" state of a container in (for instance) Docker is completely different to runc create -- there is no live process created with docker create. Most engines using runc only let runc create live for a very short time AFAIK. I might be wrong though.

rata commented 1 year ago

IMHO, either current main or reverting to the state in runc 1.1 is what seems reasonable.

Using cgroups v2 and >= 5.14 kernel is not a valid workaround for these issues in runc 1.1? (Asking due to this comment). If that is the case, maybe it will be nice to have some knob to switch the behavior to what is done in 1.1, if the extra mem is a problem, can be a safe path to have.

Also, just curious, why not hiding /proc/self/exe on newer kernels is the way to go for that CVE? That prctl is merged upstream IIRC.

113xiaoji commented 1 year ago

1 Problem Phenomenon

Phenomenon 1: systemd's CPU usage consistently exceeds 50% and doesn't decrease

Phenomenon 2: On incrementally creating 100 pods, systemd's CPU usage remains at 100%, blocking container creation

Phenomenon 3: In fault injection scenarios, persistently killing the runtime 'runc' process results in failed container termination and no automatic recovery.

2 Root Causes

Persistent >50% CPU usage by systemd

For runc: To mitigate the CVE-2019-5736 vulnerability, runc has to be cloned from the host to the container each time. The initial implementation was through memfd, copying runc to a temporary memory file, but this added overhead to the container's cgroup temporary memory. A subsequent improvement was made using bindfd, temporarily mounting runc read-only within the container. Both runc exec and runc start would trigger runc init.
Use case: On a single node scenario, numerous Pods conducted frequent probe actions using exec mode. This resulted in many mount and unmount operations, causing frequent changes in /proc/self/mountinfo.
Efficiency problem in how systemd handles mount changes: systemd consistently polls /proc/self/mountinfo using epoll, then compares the newly read file content with previous content to refresh mount points and statuses. Any mounting event can trigger this O(n) operation. Consequently, systemd's CPU usage remains above 50%.

Creating 100 pods incrementally results in 100% CPU usage by systemd and blocks container creation

systemd's CPU usage being consistently above 50% means limited available resources.
Intensively creating pods leads to numerous mount and unmount operations, causing a surge in systemd's CPU usage to 100%.
During pod creation, runc interacts with systemd through dbus to configure cgroup. Given the high CPU load on systemd, this results in pod creation timeouts. The subsequent unloading operations on the pod, involving more unmount events, further exacerbate systemd's load.

Fault injection by killing the runc process results in container termination failure and no automatic recovery

For cgroup v1:

The cgroup must first be frozen.
Then, the systemd interface is called to configure the cgroup.
Finally, the cgroup is unfrozen.

If the second step gets blocked and the corresponding process is terminated, the entire cgroup remains frozen. This eventually leads to the runc init process entering a 'D' (uninterruptible sleep) state, making the kill operation unsuccessful and preventing automatic recovery.

Solutions

Addressing the 'D' state problem where the cgroup remains frozen

Switch to cgroup v2.
Upgrade to kernel version 5.14 or higher.

Resolving the high CPU usage by systemd leading to pod creation failures

Encourage the community to abandon the bindfd method in favor of the memfd method.

Monitoring and Recovery

Monitor and recover from the 'D' state (if kernel upgrades and switching to cgroup v2 aren't feasible).

Relevant issues: #3904, #3925, #3599, #3931. @rata

cyphar commented 1 year ago

@rata PR_SET_HIDE_SELF_EXE was not merged.

lifubang commented 1 year ago

If a node cannot have 10MB in memory per container during setup, I don't think the node has enough memory to actually run the container process.

Yes, this is true for there are a little containers in a node. But maybe there are many pods in a node, for example, if there are 1000 pods in a node, we should have 13000MB(12GB) memory left in the node if 1000 sidecar pods(only do a curl http://podId:8080) starting at the same time. @cyphar

either current main or reverting to the state in runc 1.1 is what seems reasonable.

So you mean we may not face this problem in the real world? @rata

rata commented 1 year ago

@lifubang

So you mean we may not face this problem in the real world? @rata

No, not at all. I have literally no idea if this will be a problem in the real world. I just think @cyphar really want to try this and I think what we have at release 1.1 is not ideal, but people live with it. So either way seems reasonable for me. But I'm not deep into the details of this issue, so I trust what you decide (the revert you suggest, try this, etc.).

Please go for what you think it is best :)

113xiaoji commented 1 year ago

Yes, this is true for there are a little containers in a node. But maybe there are many pods in a node, for example, if there are 1000 pods in a node, we should have 13000MB(12GB) memory left in the node if 1000 sidecar pods(only do a curl http://podId:8080) starting at the same time. @cyphar

When executing pod exec, it ultimately triggers runc exec and initiates runc init. In version 1.1, frequently executing runc exec places significant strain on systemd, leading to runc init hanging. Therefore, I believe rolling back to version 1.1 won't address this issue. @lifubang

kolyshkin commented 1 year ago

Or, since that one is closed, we can discuss it here.

Once the contrib program @cyphar talks about is available, we can add it as a systemd service and push the distro vendors to properly package it, so the users won't have to.

I was also thinking about a separate runc-init binary, which should be way smaller. With #3385 merged, it becomes easier. I gave it a naive and quick try, but the resulting binary is only 20% (2MB) smaller than runc. I am not sure if it's worth pursuing further -- most probably not.

One issue we have is because reflect package is used, any dead code elimination is disabled, so go linker can't reduce binary size.

lifubang commented 1 year ago

I am not sure if it's worth pursuing further -- most probably not.

I think it's not worth to do.

Maybe we can take the DMZ concept in network management area, to write a small C program named runc-dmz, whose function is to run container's first process after nsexec.

rata commented 1 year ago

@kolyshkin

Once the contrib program @cyphar talks about is available, we can add it as a systemd service and push the distro vendors to properly package it, so the users won't have to.

I'm not convinced this is a good idea. The systemd unit will be only needed ""temporarily"" until we have a better fix in the kernel, then it will be dropped, right?

IMHO, hacks should always be contained. When the hack starts to affect lot of other projects, the hack is out of control and fixing it becomes more complex. If there was merit to the systemd unit besides this, I'm ok. But if there isn't, I'm not a supporter.

rata commented 1 year ago

@lifubang

So you mean we may not face this problem in the real world? @rata

Now I know it is a real problem in real world. Kubernetes e2e test fail with OOM with how things are in master now. They didn't fail in f73b05dee6d017865d6d8929b0ec65d29ce4f3d2 (i.e. when idmap mounts was just merged).

See for example here to see how it fails with current main branch:

Aug 25 07:54:55.882: INFO: At 2023-08-25 07:54:53 +0000 UTC - event for container-must-be-restarted: {kubelet tmp-node-e2e-da6c2bb1-cos-stable-105-17412-156-30} Failed: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown

So, a revert probably fixes that issues. BUT I can confirm the cloned binary rework PR also fixes that issue!

lifubang commented 1 year ago

BUT I can confirm the cloned binary rework PR also fixes that issue!

Yes, thanks, I have also confirmed it. #3987 is in reviewing state now, once #3987 merged, this issue could be closed.

cyphar commented 1 year ago

@rata

Now I know it is a real problem in real world. Kubernetes e2e test fail with OOM with how things are in master now. They didn't fail in https://github.com/opencontainers/runc/commit/f73b05dee6d017865d6d8929b0ec65d29ce4f3d2 (i.e. when idmap mounts was just merged).

What cgroup configuration is used on the e2e tests? Memory migration is deprecated in cgroupv1 and in cgroupv2 it was never supported, so I don't see how the memfd_create() usage could be causing an OOM.

We added the bindfd stuff in response to OOMs in Kubernetes e2e tests several years ago, but I never looked into what was actually going on -- as far as I can tell, the memory usage of memfd_create should never have affected the tests because the memory usage affects only the host memory. Do the e2e tests run the kubelet inside a restricted cgroup, so the "host" memory usage is also limited? If so, IMHO we shouldn't support running on a "host" with less than 10MB of memory.

I get that runc-dmz fixes the issue, but I just don't understand why the memory limit is being hit in the first place...

rata commented 1 year ago

@cyphar I was wondering the same, but I don't really know the node config and Kubernetes e2e tests are kind of painful in several ways, so I'm not eager to look into that :see_no_evil:. I didn't expect this to be counted towards the limit either, but SOME limit is being hit, clearly. If you want to dig into that, please go nuts :)

The runc error is coming from this: https://github.com/opencontainers/runc/blob/26a98ea20ee1e946f07fc8a9ba9f11b84b39e4a0/libcontainer/process_linux.go#L380

I don't know what can cause that we hit that.

opencontainers / runc

Consider revert #3931 nsexec: cloned_binary: remove bindfd logic entirely #3973

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

1 Problem Phenomenon

Phenomenon 1: systemd's CPU usage consistently exceeds 50% and doesn't decrease

Phenomenon 2: On incrementally creating 100 pods, systemd's CPU usage remains at 100%, blocking container creation

Phenomenon 3: In fault injection scenarios, persistently killing the runtime 'runc' process results in failed container termination and no automatic recovery.

2 Root Causes

Persistent >50% CPU usage by systemd

Creating 100 pods incrementally results in 100% CPU usage by systemd and blocks container creation

Fault injection by killing the runc process results in container termination failure and no automatic recovery

Solutions