Closed amurzeau closed 5 months ago
I found that the issue is that the cgroup is in threaded mode, and in that case, reading cgroup.procs returns ENOTSUP.
By patching runc with the following patch, tests work again and runc doesn't fail:
diff --git a/libcontainer/cgroups/utils.go b/libcontainer/cgroups/utils.go
index b32af4ee..70080efd 100644
--- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -19,6 +19,7 @@ import (
const (
CgroupProcesses = "cgroup.procs"
+ CgroupThreads = "cgroup.threads"
unifiedMountpoint = "/sys/fs/cgroup"
hybridMountpoint = "/sys/fs/cgroup/unified"
)
@@ -137,14 +138,16 @@ func GetAllSubsystems() ([]string, error) {
}
func readProcsFile(dir string) ([]int, error) {
- f, err := OpenFile(dir, CgroupProcesses, os.O_RDONLY)
+ contents, err := ReadFile(dir, CgroupProcesses)
+ if errors.Is(err, unix.ENOTSUP) {
+ contents, err = ReadFile(dir, CgroupThreads)
+ }
if err != nil {
return nil, err
}
- defer f.Close()
var (
- s = bufio.NewScanner(f)
+ s = bufio.NewScanner(strings.NewReader(contents))
out = []int{}
)
Here is the type of the cgroups (these commands were run inside the buildkit's dev-env container:
# cat /sys/fs/cgroup/buildkit/mxv4shz9kwdm0p5u49mw971ft/cgroup.type
threaded
# cat /sys/fs/cgroup/buildkit/cgroup.type
threaded
# cat /sys/fs/cgroup/cgroup.type
domain threaded
Hi, I have the same issue (with runc 1.1.10). Having this patch applied to the next version would be awesome!
@Bacto we've changed this part of runc a lot in the main branch. Can you try to repro this using runc compiled from the main branch?
Hi @kolyshkin,
I tried with the main branch and got the same issue:
# runc -v
runc version 1.1.0+dev
commit: 0c5a735
spec: 1.1.0+dev
go: go1.21.6
libseccomp: 2.5.5
Here is the type of the cgroups (these commands were run inside the buildkit's dev-env container:
# cat /sys/fs/cgroup/buildkit/mxv4shz9kwdm0p5u49mw971ft/cgroup.type threaded # cat /sys/fs/cgroup/buildkit/cgroup.type threaded # cat /sys/fs/cgroup/cgroup.type domain threaded
So the problem here is threaded
cgroup type. In this case, processes actually belong to the cgroup parent which has "domain threaded" type (i.e. top cgroup in this case). It would be incorrect to send SIGKILL to specific threads in this group. So, basically, runc kill
does the right thing here returning an error.
This is some kind of a misconfiguration, possibly caused by buildkit.
Created Debian 12 VM, checked in buildkit and ran its test suite inside a container (make test
). Was not able to reproduce.
I think there was something wrong originally when starting a container.
Would still like to get to the bottom of it, so any suggestions of how to reproduce it (ideally a vagrant file or something like this) are welcome.
The issue is fixed in main branch.
I've tried again the 1.1.5 version and reproduced it, but I don't reproduce it with the main
branch of runc
.
I've tried to find the first fixed version and found that I can reproduce the same issue with 1.1.12
but not anymore with 1.2.0-rc.1
.
So I'm closing this issue.
For reference, I'm using go test -v -run ^TestIntegration/TestDiffSingleLayer.*$ github.com/moby/buildkit/client -count=1
to run affected tests in buildkit with the tested runc
in /usr/bin/runc
.
After upgrade runc, docker still report this error:
# docker info
Runtimes: runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc version: v1.2.0-rc.1-0-g275e6d85
error:
java.io.IOException: Failed to run top 'a77dc1be093aebb8a8f18fd634adc2ebbf2d798a7c7e8e7aa5283770b1efd9b6'. Error: Error response from daemon: runc did not terminate successfully: exit status 1: unable to get all container pids: read /sys/fs/cgroup/docker/a77dc1be093aebb8a8f18fd634adc2ebbf2d798a7c7e8e7aa5283770b1efd9b6/cgroup.procs: operation not supported
# cat /sys/fs/cgroup/docker/a77dc1be093aebb8a8f18fd634adc2ebbf2d798a7c7e8e7aa5283770b1efd9b6/cgroup.type
threaded
I tried 8256a9384fa8c44aa30b3ed948e7c3e34b19b89a, which fix this problem.
I tried 8256a93, which fix this problem.
I guess you quoted a wrong commit.
@amurzeau could you do git-bisect to find which runc commit fixes it?
The first commit without the issue is f8ad20f500bf75edd86041657ee762bce116f8f5. The previous one 9583b3d1c297021109081872c52302316ede15b1, still cause the same failure.
The cause is that the failure occurs with this stacktrace:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/opencontainers/runc/libcontainer/cgroups.readProcsFile({0xc0000f2d40?, 0xc0000f30c0?})
/tmp/runc/libcontainer/cgroups/utils.go:166 +0x372
github.com/opencontainers/runc/libcontainer/cgroups.GetAllPids.func1({0xc0000f2d40, 0x31}, {0x399dc0?, 0xc0001695b0?}, {0x0?, 0x0?})
/tmp/runc/libcontainer/cgroups/getallpids.go:19 +0x79
path/filepath.walkDir({0xc0000f2d40, 0x31}, {0x399dc0, 0xc0001695b0}, 0xc000135180)
/usr/local/go/src/path/filepath/path.go:445 +0x5c
path/filepath.WalkDir({0xc0000f2d40, 0x31}, 0xc000135180)
/usr/local/go/src/path/filepath/path.go:535 +0xb0
github.com/opencontainers/runc/libcontainer/cgroups.GetAllPids({0xc0000f2d40?, 0x6?})
/tmp/runc/libcontainer/cgroups/getallpids.go:12 +0x4e
github.com/opencontainers/runc/libcontainer/cgroups/fs2.(*Manager).GetAllPids(0xc0000feb60?)
/tmp/runc/libcontainer/cgroups/fs2/fs2.go:92 +0x25
github.com/opencontainers/runc/libcontainer.signalAllProcesses({0x39d1c0, 0xc0000feb60}, 0x0?)
/tmp/runc/libcontainer/init_linux.go:583 +0xad
github.com/opencontainers/runc/libcontainer.(*Container).Signal(0xc0000bb220, {0x398770?, 0xaaa8a8}, 0x1)
/tmp/runc/libcontainer/container_linux.go:383 +0x265
main.glob..func7(0xc0000c8580)
/tmp/runc/kill.go:52 +0x113
github.com/urfave/cli.HandleAction({0x2467a0?, 0x323b50?}, 0x4?)
/tmp/runc/vendor/github.com/urfave/cli/app.go:524 +0x50
github.com/urfave/cli.Command.Run({{0x2dfe61, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x307052, 0x52}, {0x0, ...}, ...}, ...)
/tmp/runc/vendor/github.com/urfave/cli/command.go:175 +0x67b
github.com/urfave/cli.(*App).Run(0xc0000ea380, {0xc0000b4000, 0xb, 0xb})
/tmp/runc/vendor/github.com/urfave/cli/app.go:277 +0xb87
main.main()
/tmp/runc/main.go:165 +0x1208
The commit that fixes the issue (f8ad20f500bf75edd86041657ee762bce116f8f5) removes the call to c.ignoreCgroupError(signalAllProcesses(c.cgroupManager, sig))
which was part of the stacktrace.
I think this can be reproduced with this bundle: runctest_no_pid_namespace.tar.gz
To test: cd runctest && ./test.sh
The bundle's rootfs just contain a busybox binary at usr/bin/sh
and usr/bin/sleep
with linker dependency if needed (/lib/ld-whatever.so).
Running runc kill --all yield the error at commit 9583b3d1c297021109081872c52302316ede15b1:
ERRO[0000] read /sys/fs/cgroup/buildkit/runctest/cgroup.procs: operation not supported
buildkit / containerd use a pid namespace, so after f8ad20f500bf75edd86041657ee762bce116f8f5, signalAllProcesses
is not called anymore.
But without a pid namespace (as in my runctest test), it is still called:
https://github.com/opencontainers/runc/blob/f8ad20f500bf75edd86041657ee762bce116f8f5/libcontainer/container_linux.go#L386-L388
And thus still trigger the error (so I'm not sure the commit really fix the issue):
ERRO[0000] unable to signal init: read /sys/fs/cgroup/buildkit/runctest/cgroup.procs: operation not supported
Note: I'm running this test in a docker rootful container.
Description
Hi,
While testing buildkit within a docker container, tests use runc. When tring to kill a runc container, runc error out with and error like this:
read /sys/fs/cgroup/buildkit/mxv4shz9kwdm0p5u49mw971ft/cgroup.procs: operation not supported
and then return error code 1. The command line is this one:runc --root /run/containerd/runc/buildkit --log /tmp/bktest_containerd1141985211/state/io.containerd.runtime.v2.task/buildkit/mxv4shz9kwdm0p5u49mw971ft/log.json --log-format json kill --all mxv4shz9kwdm0p5u49mw971ft 9
Steps to reproduce the issue
Describe the results you received and expected
Several tests using containerd fail with this error:
What version of runc are you using?
runc version v1.1.5 spec: 1.0.2-dev go: go1.20.3 libseccomp: 2.5.4
Host OS information
Host:
container running dev-env target from Dockerfile from buildkit git repository:
Host kernel information
Linux DOC-PC3 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19) x86_64 GNU/Linux