Open samuelkarp opened 1 month ago
I'm not sure why it works with containerd 1.7, but this OOM issue with Kubernetes e2e tests is something we know about.
This is caused by #4392 (backport of #3931) which removed the code we added in #1984 to fix the same failing Kubernetes e2e test (reported in #1980). This code was removed because the bindfd logic is not as secure as using memfds and was causing significant performance problems on systems with lots of container churn. When I re-investigated the e2e issue (when working on #3931) I figured out that the OOM is happening because Kubernetes/kubelet/runc(?) is being run in a cgroup with a very low memory limit (this is not caused by the container memory limit as incorrectly claimed in #1980 because the extra memory used by memfds is never accounted by the container cgroup).
My suggestion is to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit. I'm not quite sure what the purpose of a test where you run the entire Kubernetes/kubelet stack with such a tiny memory limit is -- sure, containers with a small memory limit makes some sense, but why run the management code in such a constrained setup?
Do you have a link to where containerd or runc is being constrained to a low cgroup memory limit? containerd is run as a normal systemd unit with no memory limit and is itself responsible for directly launching runc (so runc should start inside containerd's own cgroup).
I'm not super familiar with Kubernetes e2e tests, but I believe KubeReservedMemory
is setting the limits of the cgroup that containerd runs in (or something equivalent). See runOomKillerTest
:
func runOomKillerTest(f *framework.Framework, testCase testCase, kubeReservedMemory float64) {
ginkgo.Context(testCase.name, func() {
// Update KubeReservedMemory in KubeletConfig.
if kubeReservedMemory > 0 {
tempSetCurrentKubeletConfig(f, func(ctx context.Context, initialConfig *kubeletconfig.KubeletConfiguration) {
if initialConfig.KubeReserved == nil {
initialConfig.KubeReserved = map[string]string{}
}
// There's a race condition observed between system OOM and cgroup OOM if node alocatable
// memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
// K8s components such that node allocatable memory is less than node capacity to
// observe OOM kills at cgroup level instead of system OOM kills.
initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
})
}
/* ... */
}
The issue here is not the container memory limit, because we know that cannot cause this error (the kernel allocates memory against memcgs based on the cgroup you were in when the memory was allocated -- our memfd
s are allocated and filled while in the host cgroups -- and the kernel will never migrate memory accounting on cgroupv2 even if you change cgrouops, and even with cgroupv1 you need to enable an option that is disabled on all systems by default). This was not obvious in #1980 but upon re-examination in #3931 it became clear that this is about the host memory limits, not the container ones.
I believe
KubeReservedMemory
is setting the limits of the cgroup that containerd runs in (or something equivalent).
I don't believe that's correct. My understanding is that KubeReservedMemory
controls the amount of advertised Memory resource to the Kubernetes scheduler, not that it translates to any cgroup setting on the node itself. @yujuhong, is that correct?
Yes, kube-reserved memory was used to calculate the node allocatable resources. Kubelet doesn't actually set the cgroup values for it. (BTW, for compressible resources like cpu, this PR was merged recently for kubelet to enforce the cgroup values if the cgroup path is provided: https://github.com/kubernetes/kubernetes/pull/125982)
The issue here is not the container memory limit
Yes, I saw your explanation before and that makes sense. I have also validated that our CI hosts are running cgroup v2. But I don't think we understand why it is failing; as far as I can tell there are no constraints on the cgroup runc is running in.
It's EOD here and I'm just getting some progress with this. I'll expand more tomorrow, including the repro I found.
It's still not 100% clear to me what is causing this. I can fix the k8s tests locally with this patch to the k8s repo (github.com/kubernetes/kubernetes):
diff --git test/e2e_node/oomkiller_linux_test.go test/e2e_node/oomkiller_linux_test.go
index db84cb0f0cf..ebc99abfa19 100644
--- test/e2e_node/oomkiller_linux_test.go
+++ test/e2e_node/oomkiller_linux_test.go
@@ -108,7 +108,8 @@ func runOomKillerTest(f *framework.Framework, testCase testCase, kubeReservedMem
// memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
// K8s components such that node allocatable memory is less than node capacity to
// observe OOM kills at cgroup level instead of system OOM kills.
- initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
+ //initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
+ initialConfig.KubeReserved["memory"] = "100Mi"
})
}
I can also make the tests pass changing the memory limits of them from 15Mi to 20Mi. I'm not sure why is that. I guess that the kubelet makes sure that the cgroup it is in has at least the mem limit and that is why that approach also works.
However, it is not 100% clear to me yet what the issue is nor what the right fix is.
I've also isolated the config.json created for the container (k8s uses two containers, the pause container and the real container, I'm just using the real container as the pause container doesn't have any limits nor anything interesting in the config) and then run it with runc. However, the container runs just fine when runnning it with runc.
Due to that, I suspect the issue is as @cyphar was saying, something about the setup that allocates too little mem for the KubeReserved. If I allocate a static "100Mi" for it, I see it pass locally, and if I allocate too litle (like 10Mi), I see it fail for all tests.
Please note, though, that when it fails it shows the same error than in the containerd issue comment:
[FAILED] pod: "initcontinar-oomkill-target-pod", container: "oomkill-target-init-container" has unexpected exitCode: '\u0080'
Expected
<int32>: 128
to equal
<int32>: 137
In [It] at: k8s.io/kubernetes/test/e2e_node/oomkiller_linux_test.go:152 @ 10/09/24 18:42:39.812
Therefore, what is failing in the test is the exit code. It is expecting 137 but gets 128. When the kernel kills runc[init] it seems it gets 128, but when it kills the "dd" process running in the container, it gets the right exit code.
Indeed, it seems the two test that fail locally are because runc[INIT] is being killed by a OOM in a cgroup. It correlates perfectly with what I see in the kernel logs locally:
# First test gets OOM on the runc process
[Fri Oct 11 02:51:37 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=8a6434811b9e6f4d415ea9e1cd3e66898d1257cc8947c67054c3092c1216df9d,mems_allowed=0,oom_memcg=/kubepods/burstable/poda6a4366a-76d7-46a0-b52e-c96317070a0e,task_memcg=/kubepods/burstable/poda6a4366a-76d7-46a0-b52e-c96317070a0e/8a6434811b9e6f4d415ea9e1cd3e66898d1257cc8947c67054c3092c1216df9d,task=runc:[2:INIT],pid=554398,uid=999
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554398 (runc:[2:INIT]) total-vm:1605432kB, anon-rss:1744kB, file-rss:944kB, shmem-rss:4116kB, UID:999 pgtables:168kB oom_score_adj:-997
[Fri Oct 11 02:51:37 2024] Tasks in /kubepods/burstable/poda6a4366a-76d7-46a0-b52e-c96317070a0e/8a6434811b9e6f4d415ea9e1cd3e66898d1257cc8947c67054c3092c1216df9d are going to be killed due to memory.oom.group set
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554398 (runc:[2:INIT]) total-vm:1605432kB, anon-rss:1744kB, file-rss:944kB, shmem-rss:4116kB, UID:999 pgtables:168kB oom_score_adj:-997
[Fri Oct 11 02:51:37 2024] runc:[2:INIT] invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
# Second tests gets OOM on the runc process
[Fri Oct 11 02:51:37 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=379e12af22e9387021abbb33d87ffd37862b563f7849158ddb64fd2b3b380256,mems_allowed=0,oom_memcg=/kubepods/burstable/podffbb6b04-1c85-4039-8804-a8b793230f23,task_memcg=/kubepods/burstable/podffbb6b04-1c85-4039-8804-a8b793230f23/379e12af22e9387021abbb33d87ffd37862b563f7849158ddb64fd2b3b380256,task=runc:[2:INIT],pid=554461,uid=999
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554461 (runc:[2:INIT]) total-vm:1605944kB, anon-rss:2348kB, file-rss:912kB, shmem-rss:3432kB, UID:999 pgtables:168kB oom_score_adj:-997
[Fri Oct 11 02:51:37 2024] Tasks in /kubepods/burstable/podffbb6b04-1c85-4039-8804-a8b793230f23/379e12af22e9387021abbb33d87ffd37862b563f7849158ddb64fd2b3b380256 are going to be killed due to memory.oom.group set
[Fri Oct 11 02:51:37 2024] Memory cgroup out of memory: Killed process 554461 (runc:[2:INIT]) total-vm:1605944kB, anon-rss:2348kB, file-rss:912kB, shmem-rss:3432kB, UID:999 pgtables:168kB oom_score_adj:-997
So, two times the runc init has been killed by the cgroup OOM, the two tests that fail. The other OOM I see in my logs is for the "dd" process (that is the one allocating memory) and that is passing (killing dd with oom gets 137 and not 128 exit code, that makes the test pass).
So, it seems the k8s test is setting-up a cgroup with very low mem for runc to run, this is probably due to the KubeReserved thingy, as changing that to a higher number makes the tests pass just fine here.
The current allocation is a percentage of the node capacity, as @cyphar pointed out: https://github.com/kubernetes/kubernetes/blob/95ec69c16c76b5ee71fdbebc7a5dea2c39341eb3/test/e2e_node/oomkiller_linux_test.go#L111:
if kubeReservedMemory > 0 {
tempSetCurrentKubeletConfig(f, func(ctx context.Context, initialConfig *kubeletconfig.KubeletConfiguration) {
if initialConfig.KubeReserved == nil {
initialConfig.KubeReserved = map[string]string{}
}
// There's a race condition observed between system OOM and cgroup OOM if node alocatable
// memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
// K8s components such that node allocatable memory is less than node capacity to
// observe OOM kills at cgroup level instead of system OOM kills.
initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
})
}
My gut feeling is that the VMs running the CI for containerd 1.6 have less mem than the ones for 1.7 and main, and this calculation gets too low with that VM and the new runc binary is slightly bigger and makes it fail when it seals its on-memory copy of the binary (to fix a CVE, we really want that :)).
@samuelkarp some questions for you:
My gut feeling is that this calculation based on percentage is not very good, we should at least add a max between that number and 100Mi or so. But I'd check out later what the right fix might be for this, I'm too tired now :-D
I found the relevant difference between the containerd 1.6 CI and 1.7 + main CI is which cgroup driver is in use: cgroupfs vs. systemd. When moving containerd 1.6 to use the systemd cgroup driver, the tests no longer fail.
@samuelkarp +1 to use systemd cgroup driver consistently!
@dims this isn't quite about consistency. When both Kubelet and containerd are configured to use cgroupfs, the test fails. When both Kubelet and containerd are configured to use the systemd cgroup driver, the test passes. I didn't try to test a mixed mode (I don't think we'd consider that to be supported).
@samuelkarp ah that sounds bad.
I didn't try to test a mixed mode (I don't think we'd consider that to be supported).
Using different driver on container runtime and kubelet will not work, as we have seen errors in kubernetes tests when done like that.
@rata
It's EOD here and I'm just getting some progress with this. I'll expand more tomorrow, including the repro I found.
If you have time to share the information you found, that would very much be appreciated :smile:.
My gut feeling is that the VMs running the CI for containerd 1.6 have less mem than the ones for 1.7 and main [...]
- Can you confirm what is the memory allowed for the VMs running for containerd 1.6, 1.7 and main? I guess it's different size.
All the jobs are running on GCE e2-medium VMs with 1 vCPU (2 fractional vCPUs) and 4 GiB of memory.
- Can you check what is assigned to the kubeReserved memory in containerd 1.6 and 1.7 CI? It would be great to check what those numbers are for the CI.
- Can you try to run the CI for 1.6 with the patch I pasted here for the k8s bits? I guess if we use 100Mi the CI will pass just fine.
I'll see if we can find answers for these; I'm a containerd maintainer not a Kubernetes maintainer so I'm less familiar with how these tests are run (through the Kubernetes test infrastructure, and Kubernetes test source code). But the VM sizes are the same for 1.6, 1.7, and main, so they should all have the same kubeReserved calculation.
I'm not sure if you saw above, but I did find a difference:
We don't currently have cgroup v1 test coverage in the Kubernetes e2e/containerd CI jobs; fixing that is a separate issue. Does runc have different codepaths depending on whether cgroupfs or the systemd cgroup driver is in use? I see a --systemd-cgroup
flag documented in the man page.
to fix a CVE, we really want that :)
Yep, I am familiar and agree that the CVE should remain mitigated. I also am familiar with the impact to systemd for the bindfd mitigation, so I understand why that's undesirable as well.
My suggestion is to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit.
Yes, if this can be changed in kubernetes e2e test, it will be better.
The core reason is that the memory accounting should not include runc init
processes before we start the container's init process. So, we can join the cgroup as later as possible, at least after call ensure_cloned_binary
, then the memory usage will not be included in cgroup memory accounting. (LIKE #4438).
Please consider whether this is worth to draft a new release 1.1.16 or not?
Could you help to test #4439 in containerd 1.6?
I have test it in my machine, it works fine:
root@acmcoder:/opt/ubuntu# cat config.json | jq .process.args
[
"cat",
"/sys/fs/cgroup/memory.peak",
"/sys/fs/cgroup/memory.max",
"/sys/fs/cgroup/cgroup.procs"
]
root@acmcoder:/opt/ubuntu# cat config.json | jq .linux.resources.memory
{
"limit": 10485760
}
So, the first line of output from the container is the peak usage of memory, the second line is the limit of memory.
Using runc 1.1.15, the peak memory usage will big than the limit for sometimes:
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 -v
runc version 1.1.15
commit: v1.1.15-0-gbc20cb44
spec: 1.0.2-dev
go: go1.22.3
libseccomp: 2.5.5
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
5681152
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
2572288
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
8728576
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
5181440
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
11657216
10485760
1
# peak > max
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
2629632
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
11231232
10485760
1
# peak > max
Using #4439, there is no situation of peak>max:
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev -v
runc version 1.1.15+dev
commit: v1.1.13-40-gea5f8e04
spec: 1.0.2-dev
go: go1.23.2
libseccomp: 2.5.5
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2789376
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2916352
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3239936
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2711552
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2662400
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3051520
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2797568
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3293184
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2789376
10485760
1
Please consider whether this is worth to draft a new release 1.1.16 or not?
And I think there is no this situation in the main branch, because the binary clone operation has been moved from runc init
to runc create/run
. Could you consider waiting for 1.2.0 release?
I have triggered a test to use runc https://github.com/opencontainers/runc/pull/4439/commits/ea5f8e0445e80dc29acd9b5d57481a0e936af89b.
cgroupfs job: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/containerd_containerd/10797/pull-containerd-node-e2e-1-6/1844595194326945792 systemd job: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/containerd_containerd/10797/pull-containerd-node-e2e-1-6-systemd-cgroup/1844595302615486464
The test with the https://github.com/opencontainers/runc/pull/4439 fix has passed for both cgroupfs and systemd
@samuelkarp @akhilerm @lifubang Great findings, thanks!
I think I found a bug in the PR @lifubang created (commented there). But when trying to fix it, I think I might have found another bug that was there for a while. More details and links below :)
But first things first. Let me share my setup to run the kubernetes tests locally, as it is definitely not trivial to find (or it wasn't for me :D).
Checkout the containerd repo, start containerd running as root. I'm just doing this:
make && sudo make install && sudo bin/containerd --config test-runc-mem-issue.toml
The config doesn't seem to be special. I'm using version 2 and just changed from the default (generated with bin/containerd condig default
a few years ago, probably what containerd 1.6/1.7 dumps today) this:
root = "/var/lib/containerd-rata"
state = "/run/containerd-rata"
[grpc]
address = "/run/containerd-rata/containerd.sock"
I tried chaning the systemd_cgroup to true/false, but it didn't seem to make a difference. My config is here, just in case: https://paste.debian.net/1331999/
I'm using containerd in this commit 906c23218c21373961bae0665a1dec6b89852fc2. Although I don't think this is relevant either.
Then, checkout the kubernetes repo (github.com/kubernetes/kuberentes). I'm using this commit, although it doesn't seem to matter either: d9c46d8ecb1ede9be30545c9803e17682fcc4b50.
Then, do this from the repo root:
export PATH="/home/rodrigo/src/kinvolk/kubernetes/kubernetes/third_party/etcd:${PATH}"
export CONTAINER_RUNTIME_ENDPOINT=/run/containerd-rata/containerd.sock
export IMAGE_SERVICE_ENDPOINT=/run/containerd-rata/containerd.sock
export CONTAINER_RUNTIME=remote
cp test/e2e_node/jenkins/default-kubelet-config.yaml /tmp/e2e-kubelet.yaml
# Edit the file: remove empty lines and comments and set failSwapOn: false. Otherwise the kubelet fails to start
vim /tmp/e2e-kubelet.yaml
# Compile the test binaries
make all
# Finally run the tests
KUBELET_CONFIG_FILE="/tmp/e2e-kubelet.yaml" KUBE_ROOT="$PWD" FOCUS="OOMKiller" hack/make-rules/test-e2e-node.sh
Here it doesn't fail reliably (but it never takes more than 3 tries to hit the fail) and that is what confused me in the previous message. In the CI does fail/pass reliably, but not locally for me, which I didn't expect.
Also, while debugging the issue locally, I created a bash script to wrap runc. I printed the content of /proc/self/cgroup
and there is indeed no cgroup applies to runc, it always printed the file as empty. So, there seems to be no cgroup in which runc is running for these tests. That is not creating the OOM
The runc-wrapper script is just this simple script, just in case:
#!/bin/bash
echo "Running: runc $@" >> /tmp/rata.log
echo "running runc in cgroup: $(cat /proc/selc/cgroup)" >> /tmp/rata.log
exec <path to real runc> "$@"
In my previous message I thought the kubelet was creating a cgroup to run runc and that was failing. It seems the kubelet can do that (for the kubeReserved, etc.) but it is not doing it in the tests that fail. The fact that fails reliably on CI and not locally, plus that it was kind of late here, confused me the other day. Let's see if this time it is better, as it's also getting late now :D.
@lifubang great findings! I think that fix sounds reasonable at first sight, but it might be introducing a bug (explained it here). I've experimented with a way to fix it here: https://github.com/rata/runc/commits/runc-1.1-mem-issue/.
EDIT: The race doesn't seem possible, because nsexec does another read (a lot of them!) before exec'ing into the user process. So this is safe, the comment in the code is about applying the cgroup before sending the bootstrapData, so we sync on that. Sorry for the noise, I leave below what race I thought there was.
While doing it, I realized we are not waiting to join the cgroup for the setns case (we were only doing it for the initProcess case). In other words, on the setns case, the nsexec.c might continue and create the process before we apply it to the cgroup (we are not syncing on that, although it is unliklely to win that race in practice). My branch there changes that too. It seems to me this lack of sync was on purpose, but I can't see why now. Any thoughts?
In fact, I'd need to check if in (*initProcess).start(), the var p.messageSockPair.parent
that is of time *os.File
, is buffered or a write there blocks until the other end reads? Because this comment seems to assume that nsexec.c when writing here it will block until the go side reads. But I'm unsure that is the case (I'll test it later).
If that is the case, though, my branch is fixing it in any case, as we are doing a read on nsexec now (that will block until there is something to read).
My branch is a work in progress, not ready for a PR yet. But enough to play and experiment further on this.
Am I missing something? @cyphar can you take a look at my branch regarding the nsexec races I explained above?
It's not clear to me yet why on CI only the cgroupfs driver seems to be affected, if anyone have any thoughts on this, please let me know :)
Some TODOs so I don't forget:
Unfortunately, #4399 will influence runc's performance, please see https://github.com/opencontainers/runc/pull/4439#discussion_r1796784791
100 containers test result: runc-1.1.15: 3.025s patch 4399: 4.123s
But we should know that patch 4399 has the same start time usage with v1.2.0-rc.3
.
This is because the time of join cgroup is a little less than the time of binary clone,
so let them executed sequentially will raise the start time usage of the container.
So #4399 and v1.2.0-rc.3
has the same start time usage.
In another words, we should also have to find a solution to reduce the runc start time usage for v1.2.0
.
I strongly recommend to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit. For example: 20M.
@rata is /proc/selc/cgroup
(selc
) a typo just in your comment or also in your wrapper script? 👀
@tianon great catch! It's a typo, I adjusted to remove some other stuff I have, sorry :D
@lifubang How did you test the container time? Just a bash for loop and the time it takes to do all of them? Can you share, just in case?
Also, 1100ms difference on 100 containers is just 11ms increase time for each creation. And if it's on-par with main, is this really a big issue? Or am I again too tired now? :D
Just a bash for loop and the time it takes to do all of them?
Yes.
Can you share, just in case?
I can share it tomorrow, because it's in my another PC.
My step to test:
1) I put all versions runc we want to test in current dir.
2) Update config.json: .process.args
= [ "true" ]
| .process.terminal
= false
;
3) ./bench.sh
#!/bin/bash
ver=$1
for i in {1..100}; do
./runc-$ver run test
done
4) Run it: time ./bench.sh 1.1.15
I am hitting this issue randomly in my local setup, when using vagrant boxes with systemd while running CRI tests. Tried using the exact same steps as used in the containerd CI.
This seems to also affect other containerd branches, not just 1.6 with cgroupfs. See this from main (main is not using cgroupfs):
IMHO, we should release 1.1.16 as soon as we merge #4439
None situation is ideal, either without this fix or with this fix. I think these test's mem limits are kind of artificial and probably not super relevant in real life, but it is breaking the CI for users, so it is a real problem. On the other hand, with #4439, the memory of the clone will go to the main runc process (so outside the container cgroup) and maybe real setups might have an issue with that (like maybe people using kube-reserved and a lot of churn? or kubernetes clusters with readiness probes that use exec + using kube-reserved?).
For now the latter is a theretical idea that we don't know it will happen or not, so IMHO let's just fix the real problems that we have, as #4439 does, and if this is a problem for users, of course we can revert and ask projects to adjusts the CI tests.
I see runc 1.2.0 was released. A few questions:
@samuelkarp thanks, I wanted to update here.
Apologies for the delay; I was away for a few days.
Description
In https://github.com/containerd/containerd/pull/10795 we're attempting to update containerd 1.6 to use runc 1.1.15. This triggers a Kubernetes e2e CI run, which is consistently failing. The same update to runc 1.1.15 succeeds with containerd 1.7 and
main
(which will be 2.0).Steps to reproduce the issue
We're currently attempting to narrow down differences between the test environments used in containerd's various branches, since it seems unlikely the actual containerd version will be the cause of the difference. So far, containerd 1.6's tests are using the cgroupfs driver for managing cgroups as opposed to systemd, while the 1.7 and main branch tests use systemd to manage cgroups.
All tests are running on COS M117 and appear to be using cgroup v2 as of now.
Describe the results you received and expected
What version of runc are you using?
1.1.15
Host OS information
No response
Host kernel information
No response