Closed lifubang closed 1 year ago
@cyphar PTAL, thanks.
@lifubang does reverting those commit fixes the issue?
does reverting those commit fixes the issue?
Yes, it works very well after reverted #3931 .
I can add the contrib program to bind-mount the memfd which should eliminate this issue (see the discussion in the PR). I didn't add it because the primary concern last time was container memory usage, and memfd_create(2)
no longer causes issues. The only downside is you need to have a small daemon running on the system.
I really do not want to have to keep bindfd...
I encountered a similar issue. When using the main branch, my test case failed. However, when I solely used the 'remove bindfd' commit, my test case passed. I suspect that this PR might be conflicting with other PRs on the main branch, leading to a new performance issue. I'd recommend not using the main branch and instead, trying only @cyphar's PR, @lifubang.
Oh my, this is tough. Let's discuss further in https://github.com/opencontainers/runc/pull/3931
However, when I solely used the 'remove bindfd' commit, my test case passed.
With your suggestion, it can't fix my issue.
I suspect that this PR might be conflicting with other PRs on the main branch
I think maybe not this problem. I think obviously it's a design problem, but not a code bug, cyphar has confirmed it, and he has had a solution to resolve this issue.
So, maybe your issue is not the same as mine. You can discuss it further in #3931 or open an new issue once you know how to reproduce it clearly.
However, when I solely used the 'remove bindfd' commit, my test case passed.
With your suggestion, it can't fix my issue.
I suspect that this PR might be conflicting with other PRs on the main branch
I think maybe not this problem. I think obviously it's a design problem, but not a code bug, cyphar has confirmed it, and he has had a solution to resolve this issue.
So, maybe your issue is not the same as mine. You can discuss it further in #3931 or open an new issue once you know how to reproduce it clearly.
The node is in stuck state. Are there any logs?
However, when I solely used the 'remove bindfd' commit, my test case passed.
With your suggestion, it can't fix my issue.
I suspect that this PR might be conflicting with other PRs on the main branch
I think maybe not this problem. I think obviously it's a design problem, but not a code bug, cyphar has confirmed it, and he has had a solution to resolve this issue.
So, maybe your issue is not the same as mine. You can discuss it further in #3931 or open an new issue once you know how to reproduce it clearly.
Maybe you could test with @cyphar's suggestion. I show you memfd code afternoon.
I will post a PR, but the code in question is https://github.com/opencontainers/runc/pull/3931#discussion_r1259665352. If you use it to mount on top of the runc
binary, you should be able to run as many runc
processes as you like without extra memory overhead. The downside is that the program needs to keep running as a daemon -- if the process dies, you cannot exec the mounted runc
path (and package updates are more annoying -- you need to stop the daemon, upgrade the runc
package, then restart the daemon).
memfd code is
package main
import (
"errors"
"fmt"
"io"
"os"
"os/signal"
"runtime"
"strings"
"time"
"github.com/urfave/cli"
"golang.org/x/sys/unix"
)
// version will be populated by the Makefile, read from
// VERSION file of the source code.
var version = ""
// gitCommit will be the hash that the binary was built from
// and will be populated by the Makefile
var gitCommit = ""
const (
usage = `Open Container Initiative contrib/cmd/memfd-bind
In order to protect against certain container attacks, every runc invocation
that involves creating or joining a container will cause runc to make a copy of
the runc binary in memory (usually to a memfd). While "runc init" is very
short-lived, this extra memory usage can cause problems for containers with
very small memory limits (or containers that have many "runc exec" invocations
applied to them at the same time).
memfd-bind is a tool to create a persistent memfd-sealed-copy of the runc binary,
which will cause runc to not make its own copy. This means you can get the
benefits of using a sealed memfd as runc's binary (even in a container breakout
attack to get write access to the runc binary, neither the underlying binary
nor the memfd copy can be changed).
To use memfd-bind, just specify which path you want to create a socket path at
which you want to receive terminals:
$ sudo memfd-bind /usr/bin/runc
Note that (due to kernel restrictions on bind-mounts), this program must remain
running on the host in order for the binary to be readable (it is recommended
you use a systemd unit to keep this process around).
If this program dies, there will be a leftover mountpoint that always returns
-EINVAL when attempting to access it. You need to use memfd-bind --cleanup on the
path in order to unmount the path (regular umount(8) will not work):
$ sudo memfd-bind --cleanup /usr/bin/runc
Note that (due to restrictions on /proc/$pid/fd/$fd magic-link resolution),
only privileged users (specifically, those that have ptrace privileges over the
memfd-bind daemon) can access the memfd bind-mount. This means that using this
tool to harden your /usr/bin/runc binary would result in unprivileged users
being unable to execute the binary. If this is an issue, you could make all
privileged process use a different copy of runc (by making a copy in somewhere
like /usr/sbin/runc) and only using memfd-bind for the version used by
privileged users.
`
)
func cleanup(path string) error {
file, err := os.OpenFile(path, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
if err != nil {
return fmt.Errorf("cleanup: failed to open runc binary path: %w", err)
}
defer file.Close()
fdPath := fmt.Sprintf("/proc/self/fd/%d", file.Fd())
// Keep umounting until we hit a umount error.
for unix.Unmount(fdPath, unix.MNT_DETACH) == nil {
// loop...
}
fmt.Fprintf(os.Stdout, "memfd-bind: path %q has been cleared of all old bind-mounts\n", path)
return nil
}
func memfdClone(path string) (*os.File, error) {
binFile, err := os.Open(path)
if err != nil {
return nil, fmt.Errorf("memfd clone: failed to open runc binary path: %w", err)
}
defer binFile.Close()
const memfdComment = "runc_cloned:/proc/self/exe"
memfd, err := unix.MemfdCreate(memfdComment, unix.MFD_CLOEXEC|unix.MFD_EXEC|unix.MFD_ALLOW_SEALING)
if errors.Is(err, unix.EINVAL) {
// old kernel with no MFD_EXEC support
memfd, err = unix.MemfdCreate(memfdComment, unix.MFD_CLOEXEC|unix.MFD_ALLOW_SEALING)
}
if err != nil {
return nil, fmt.Errorf("memfd clone: failed to create memfd: %w", err)
}
memfdFile := os.NewFile(uintptr(memfd), memfdComment)
// we *do not* defer memfdFile.Close() here...
// Copy the contents...
for {
_, err = io.Copy(memfdFile, binFile)
if !errors.Is(err, unix.EINTR) {
break
}
}
if err != nil {
return nil, fmt.Errorf("memfd clone: failed to copy binary text to memfd: %w", err)
}
// Seal the memfd. We ignore errors for the newer seals, but for the
// RUNC_MEMFD_MIN_SEALS set we will return an error. This matches the
// behaviour of libcontainer/nsenter/cloned_binary.c.
const RUNC_MEMFD_MIN_SEALS = unix.F_SEAL_SEAL | unix.F_SEAL_SHRINK | unix.F_SEAL_GROW | unix.F_SEAL_WRITE
const F_SEAL_EXEC = 0x20 // from <linux/fcntl.h>
_, _ = unix.FcntlInt(uintptr(memfd), unix.F_ADD_SEALS, unix.F_SEAL_FUTURE_WRITE)
_, _ = unix.FcntlInt(uintptr(memfd), unix.F_ADD_SEALS, F_SEAL_EXEC)
_, err = unix.FcntlInt(uintptr(memfd), unix.F_ADD_SEALS, RUNC_MEMFD_MIN_SEALS)
if err != nil {
return nil, fmt.Errorf("memfd clone: failed to add required seals to memfd: %w", err)
}
fmt.Fprintf(os.Stdout, "memfd-bind: created sealed memfd with contents of %q\n", path)
return memfdFile, nil
}
func mount(path string) error {
memfdFile, err := memfdClone(path)
if err != nil {
return err
}
defer memfdFile.Close()
memfdPath := fmt.Sprintf("/proc/self/fd/%d", memfdFile.Fd())
// We have to open an O_NOFOLLOW|O_PATH to the memfd magic-link because we
// cannot bind-mount the memfd itself (it's in the internal kernel mount
// namespace and cross-mount-namespace bind-mounts are not allowed). This
// also requires that this program stay alive continuously for the
// magic-link to stay alive...
memfdLink, err := os.OpenFile(memfdPath, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
if err != nil {
return fmt.Errorf("mount: failed to /proc/self/fd magic-link for memfd: %w", err)
}
defer memfdLink.Close()
memfdLinkFdPath := fmt.Sprintf("/proc/self/fd/%d", memfdLink.Fd())
exeFile, err := os.OpenFile(path, unix.O_PATH|unix.O_NOFOLLOW|unix.O_CLOEXEC, 0)
if err != nil {
return fmt.Errorf("mount: failed to open target runc binary path: %w", err)
}
defer exeFile.Close()
exeFdPath := fmt.Sprintf("/proc/self/fd/%d", exeFile.Fd())
err = unix.Mount(memfdLinkFdPath, exeFdPath, "", unix.MS_BIND, "")
if err != nil {
return fmt.Errorf("mount: failed to mount memfd on top of runc binary path target: %w", err)
}
// If there is a signal we want to do cleanup.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, unix.SIGTERM, unix.SIGINT)
go func() {
<-sigCh
fmt.Fprintf(os.Stdout, "memfd-bind: exit signal caught! cleaning up the bind-mount on %q...\n", path)
_ = cleanup(path)
os.Exit(0)
}()
// Clean up things we don't need...
_ = exeFile.Close()
_ = memfdLink.Close()
// We now have to stay alive to keep the magic-link alive...
fmt.Fprintf(os.Stdout, "memfd-bind: bind-mount of memfd over %q created -- looping forever!\n", path)
for {
// loop forever...
time.Sleep(time.Duration(1<<63 - 1))
// make sure the memfd isn't gc'd
runtime.KeepAlive(memfdFile)
}
}
func main() {
app := cli.NewApp()
app.Name = "memfd-bind"
app.Usage = usage
// Set version to be the same as runC.
var v []string
if version != "" {
v = append(v, version)
}
if gitCommit != "" {
v = append(v, "commit: "+gitCommit)
}
app.Version = strings.Join(v, "\n")
// Set the flags.
app.Flags = []cli.Flag{
cli.BoolFlag{
Name: "cleanup",
Usage: "Do not create a new memfd-sealed file, only clean up an existing one at <path>.",
},
}
app.Action = func(ctx *cli.Context) error {
args := ctx.Args()
if len(args) != 1 {
return errors.New("need to specify a single path to the runc binary")
}
path := ctx.Args()[0]
err := cleanup(path)
// We only care about cleanup errors when doing --cleanup.
if ctx.Bool("cleanup") {
return err
}
return mount(path)
}
if err := app.Run(os.Args); err != nil {
fmt.Fprintf(os.Stderr, "memfd-bind: %v\n", err)
os.Exit(1)
}
}
@lifubang
I have tried it, it works very well.
My question is that this contrib program is for personal use purpose? I don't know whether other upstream projects, such as containerd
, docker
, and k8s
will change their install script to use it or not? If yes, there will be lots of work to do for these projects.
I think this issue may impact some k8s/dockerswarm cluster when the node is in case of insufficient memory state. So I suggust to revert it before we have an new solution. But if we confirm that it will not impact upstream project, I think we can consider add a contrib program for anyone who needs it.
The problem is that we cannot just revert it:
runc
processes running by having a daemon on the system. In an absolute worst case, we can look at getting Docker/containerd/cri-o to do the memfd bind-mount in their daemons in order to make the setup be automatic.mount_lock
on systems with loads of runc programs running./proc/self/exe
CVE, at least not until I get the magic-link hardening work I'm working on merged into the kernel.From where I'm standing, it seems to me that #3931 is the least bad option. Every option has some kind of cost, but at least the memfd one can be mitigated by users -- the mount-based ones affect you and you cannot disable them.
I think this issue may impact some k8s/dockerswarm cluster when the node is in case of insufficient memory state.
If a node cannot have 10MB in memory per container during setup, I don't think the node has enough memory to actually run the container process. Also I don't think Docker, containerd, cri-o or Kubernetes keep processes in the runc create
state for long periods of time. The "created" state of a container in (for instance) Docker is completely different to runc create
-- there is no live process created with docker create
. Most engines using runc only let runc create
live for a very short time AFAIK. I might be wrong though.
IMHO, either current main or reverting to the state in runc 1.1 is what seems reasonable.
Using cgroups v2 and >= 5.14 kernel is not a valid workaround for these issues in runc 1.1? (Asking due to this comment). If that is the case, maybe it will be nice to have some knob to switch the behavior to what is done in 1.1, if the extra mem is a problem, can be a safe path to have.
Also, just curious, why not hiding /proc/self/exe
on newer kernels is the way to go for that CVE? That prctl is merged upstream IIRC.
For runc: To mitigate the CVE-2019-5736 vulnerability, runc has to be cloned from the host to the container each time. The initial implementation was through memfd, copying runc to a temporary memory file, but this added overhead to the container's cgroup temporary memory. A subsequent improvement was made using bindfd, temporarily mounting runc read-only within the container. Both runc exec and runc start would trigger runc init.
Use case: On a single node scenario, numerous Pods conducted frequent probe actions using exec mode. This resulted in many mount and unmount operations, causing frequent changes in /proc/self/mountinfo.
Efficiency problem in how systemd handles mount changes: systemd consistently polls /proc/self/mountinfo using epoll, then compares the newly read file content with previous content to refresh mount points and statuses. Any mounting event can trigger this O(n) operation. Consequently, systemd's CPU usage remains above 50%.
systemd's CPU usage being consistently above 50% means limited available resources.
Intensively creating pods leads to numerous mount and unmount operations, causing a surge in systemd's CPU usage to 100%.
During pod creation, runc interacts with systemd through dbus to configure cgroup. Given the high CPU load on systemd, this results in pod creation timeouts. The subsequent unloading operations on the pod, involving more unmount events, further exacerbate systemd's load.
For cgroup v1:
The cgroup must first be frozen.
Then, the systemd interface is called to configure the cgroup.
Finally, the cgroup is unfrozen.
If the second step gets blocked and the corresponding process is terminated, the entire cgroup remains frozen. This eventually leads to the runc init process entering a 'D' (uninterruptible sleep) state, making the kill operation unsuccessful and preventing automatic recovery.
Addressing the 'D' state problem where the cgroup remains frozen
Switch to cgroup v2.
Upgrade to kernel version 5.14 or higher.
Resolving the high CPU usage by systemd leading to pod creation failures
Monitoring and Recovery
Relevant issues: #3904, #3925, #3599, #3931. @rata
@rata PR_SET_HIDE_SELF_EXE
was not merged.
If a node cannot have 10MB in memory per container during setup, I don't think the node has enough memory to actually run the container process.
Yes, this is true for there are a little containers in a node. But maybe there are many pods in a node, for example, if there are 1000 pods in a node, we should have 13000MB(12GB) memory left in the node if 1000 sidecar pods(only do a curl http://podId:8080) starting at the same time. @cyphar
either current main or reverting to the state in runc 1.1 is what seems reasonable.
So you mean we may not face this problem in the real world? @rata
@lifubang
So you mean we may not face this problem in the real world? @rata
No, not at all. I have literally no idea if this will be a problem in the real world. I just think @cyphar really want to try this and I think what we have at release 1.1 is not ideal, but people live with it. So either way seems reasonable for me. But I'm not deep into the details of this issue, so I trust what you decide (the revert you suggest, try this, etc.).
Please go for what you think it is best :)
Yes, this is true for there are a little containers in a node. But maybe there are many pods in a node, for example, if there are 1000 pods in a node, we should have 13000MB(12GB) memory left in the node if 1000 sidecar pods(only do a curl http://podId:8080) starting at the same time. @cyphar
When executing pod exec, it ultimately triggers runc exec and initiates runc init. In version 1.1, frequently executing runc exec places significant strain on systemd, leading to runc init hanging. Therefore, I believe rolling back to version 1.1 won't address this issue. @lifubang
Or, since that one is closed, we can discuss it here.
Once the contrib program @cyphar talks about is available, we can add it as a systemd service and push the distro vendors to properly package it, so the users won't have to.
I was also thinking about a separate runc-init
binary, which should be way smaller. With #3385 merged, it becomes easier. I gave it a naive and quick try, but the resulting binary is only 20% (2MB) smaller than runc
. I am not sure if it's worth pursuing further -- most probably not.
One issue we have is because reflect
package is used, any dead code elimination is disabled, so go linker can't reduce binary size.
I am not sure if it's worth pursuing further -- most probably not.
I think it's not worth to do.
Maybe we can take the DMZ
concept in network management area, to write a small C program named runc-dmz
, whose function is to run container's first process after nsexec
.
@kolyshkin
Once the contrib program @cyphar talks about is available, we can add it as a systemd service and push the distro vendors to properly package it, so the users won't have to.
I'm not convinced this is a good idea. The systemd unit will be only needed ""temporarily"" until we have a better fix in the kernel, then it will be dropped, right?
IMHO, hacks should always be contained. When the hack starts to affect lot of other projects, the hack is out of control and fixing it becomes more complex. If there was merit to the systemd unit besides this, I'm ok. But if there isn't, I'm not a supporter.
@lifubang
So you mean we may not face this problem in the real world? @rata
Now I know it is a real problem in real world. Kubernetes e2e test fail with OOM with how things are in master now. They didn't fail in f73b05dee6d017865d6d8929b0ec65d29ce4f3d2 (i.e. when idmap mounts was just merged).
See for example here to see how it fails with current main branch:
Aug 25 07:54:55.882: INFO: At 2023-08-25 07:54:53 +0000 UTC - event for container-must-be-restarted: {kubelet tmp-node-e2e-da6c2bb1-cos-stable-105-17412-156-30} Failed: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
So, a revert probably fixes that issues. BUT I can confirm the cloned binary rework PR also fixes that issue!
BUT I can confirm the cloned binary rework PR also fixes that issue!
Yes, thanks, I have also confirmed it. #3987 is in reviewing state now, once #3987 merged, this issue could be closed.
@rata
Now I know it is a real problem in real world. Kubernetes e2e test fail with OOM with how things are in master now. They didn't fail in https://github.com/opencontainers/runc/commit/f73b05dee6d017865d6d8929b0ec65d29ce4f3d2 (i.e. when idmap mounts was just merged).
What cgroup configuration is used on the e2e tests? Memory migration is deprecated in cgroupv1 and in cgroupv2 it was never supported, so I don't see how the memfd_create()
usage could be causing an OOM.
We added the bindfd stuff in response to OOMs in Kubernetes e2e tests several years ago, but I never looked into what was actually going on -- as far as I can tell, the memory usage of memfd_create
should never have affected the tests because the memory usage affects only the host memory. Do the e2e tests run the kubelet inside a restricted cgroup, so the "host" memory usage is also limited? If so, IMHO we shouldn't support running on a "host" with less than 10MB of memory.
I get that runc-dmz
fixes the issue, but I just don't understand why the memory limit is being hit in the first place...
@cyphar I was wondering the same, but I don't really know the node config and Kubernetes e2e tests are kind of painful in several ways, so I'm not eager to look into that :see_no_evil:. I didn't expect this to be counted towards the limit either, but SOME limit is being hit, clearly. If you want to dig into that, please go nuts :)
The runc error is coming from this: https://github.com/opencontainers/runc/blob/26a98ea20ee1e946f07fc8a9ba9f11b84b39e4a0/libcontainer/process_linux.go#L380
I don't know what can cause that we hit that.
Description
With runc built from the main branch: If I create 450 containers in a 8G memory node, all the memory will be eat by runc. The node will be in stuck state.
With runc 1.1.8, it will success.
I think when using
memfd_create
, runc binary will be in memory, so if we batch create containers, it will eat node's memory before the container started. In my head, maybecontainerd
usesrunc create
for the first step when run a container.Steps to reproduce the issue
the bash script:
Describe the results you received and expected
received: The node is in stuck state, because there is no enough memory left in the node.
expected: All containers should be created normally.
What version of runc are you using?
runc version 1.1.0+dev commit: v1.1.0-688-g74c125d8 spec: 1.1.0 go: go1.18.4 libseccomp: 2.5.1
Host OS information
PRETTY_NAME="Ubuntu 22.04.2 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.2 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy
Host kernel information
Linux iZ2ze4a8qvjqt6lt7ommf8Z 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux