egernst commented 3 years ago

Today we only clear out the cpuset details when doing an update call on existing container/pods. This works in the case of Kubernetes, but not in the case where we are explicitly setting the cpuset details at boot time. For example, if you are running a single container via docker ala:

docker run --cpuset-cpus 0-3 -it alpine sh

What would happen is the cpuset info would be passed in with the container spec for create container request to the agent. At that point in time, there'd only be the defualt number of CPUs available in the guest (1), so you'd be left with cpusets set to 0. Next, we'd hotplug the vCPUs, providing 0-4 CPUs in the guest, but the cpuset would never be updated, leaving the application tied to CPU 0.

Ouch.

Until the day we support cpusets in the guest, let's make sure that we start off clearing the cpuset fields.

Fixes: ####

Signed-off-by: Eric Ernst eric.g.ernst@gmail.com

egernst commented 3 years ago

/test

codecov[bot] commented 3 years ago

Codecov Report

Merging #3138 (22a334d) into master (86f0550) will increase coverage by 0.02%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3138      +/-   ##
==========================================
+ Coverage   50.26%   50.28%   +0.02%     
==========================================
  Files         120      120              
  Lines       15998    16006       +8     
==========================================
+ Hits         8041     8049       +8     
  Misses       6869     6869              
  Partials     1088     1088

likebreath commented 3 years ago

/test-clh

likebreath commented 3 years ago

/test

egernst commented 3 years ago

Looks like a lot of timeouts on CI jobs. Let’s try one more time...

/test

jcvenegas commented 3 years ago

/test

likebreath commented 3 years ago

All CLH CI failures are caused by changes from a dependency repo of cloud-hypervisor, which is causing building static binary of CLH to fail.

To fix the CI failures, I sent a PR to the packaging repo (https://github.com/kata-containers/packaging/pull/1192) to not build CLH from source when the released binary is available (same as what we are doing in kata 2.0).

jcvenegas commented 3 years ago

/test-virtiofs

likebreath commented 3 years ago

/test-clh

likebreath commented 3 years ago

Added a do-not-merge to gate on the CLH CI failures.

likebreath commented 3 years ago

The CLH CI failures are now being fixed. But there are quite many other failing CIs. Any comments on those? @egernst @jcvenegas @devimc @GabyCT @chavafg

Edit: with a closer look, all the failures should be related to the PR changes. Please see below for further details, and please share your thoughts. Thanks.

likebreath commented 3 years ago

Also, quite many CI (clh-docker, virtiofs, initrd, vsocks, rhel) are failing on the --cpuset-cpus test which makes sense:

13:54:59 CPU constraints
13:54:59 /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:147
13:54:59   checking container with CPU constraints
13:54:59   /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:167
13:54:59     with cpuset-cpus to 0
13:54:59     /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:195
13:54:59       /sys/fs/cgroup/cpuset/cpuset.cpus should have 0 [It]
13:54:59       /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:196
13:54:59 
13:54:59       Expected
13:54:59           <string>: 0
13:54:59       to equal
13:54:59           <string>: 0-1

The actual output of /sys/fs/cgroup/cpuset/cpuset.cpus in the guest ("0-1") actually makes sense with this patch, as we are never constraining the cpu/cpuset cgroups in the guest and we are always having 1 (default vCPU) + requested number of vCPU from docker CLI. To fix the docker test, I think we should adjust the Expected value to 0-1. WDYT?

GabyCT commented 3 years ago

@likebreath , I see that the CI failures are related with the CPU constraints tests as well as sandbox cgroup of the shimv2 tests. Maybe related with this change?

likebreath commented 3 years ago

I see that the CI failures are related with the CPU constraints tests as well as sandbox cgroup of the shimv2 tests. Maybe related with this change?

That's likely. Also, we may need to adjust the shimv2 tests accordingly, as what I mentioned above for the docker tests.

egernst commented 3 years ago

I agree -- we should be updating the tests. @likebreath are you PRing those changes?

likebreath commented 3 years ago

I agree -- we should be updating the tests. @likebreath are you PRing those changes?

Sure. I can open a PR to the test repo to fix the docker tests. As I am not familiar with the shimv2 test, can you please take a shot on that one? Thanks. @egernst

egernst commented 3 years ago

Also; it seems we aren't exercising this on 2.0 tests repo, as we were able to merge w/ out test (buggy tests) failures? :-(

likebreath commented 3 years ago

I opened a PR to fix the docker cpuset test. Please check the details in the issue (https://github.com/kata-containers/tests/issues/3241) and the PR (https://github.com/kata-containers/tests/pull/3242). Any questions or comments, please let me know. Thank you.

likebreath commented 3 years ago

As the PR kata-containers/tests#3242 is landed, restarted the CI jobs on clh-docker, virtiofs, initrd, vsocks, rhel.

egernst commented 3 years ago

I see the one job fail still: http://jenkins.katacontainers.io/blue/organizations/jenkins/kata-containers-runtime-cri-containerd-PR/detail/kata-containers-runtime-cri-containerd-PR/2489/pipeline#log-2409

likebreath commented 3 years ago

TL;DR: All CI jobs are now passing on this PR, while instability of the CI jenkins-ci-cri-containerd is observed (that is not related to this PR changes). Merging the PR now.

I checked the history of the CI failure on jenkins-ci-cri-containerd, where I found the original failure (CI job 2487) was also on the docker cpuset test (but as a part of "sandbox-cgroup" test). Failure details as below. This failure is already being fixed by the PR kata-containers/tests#3242. Double checked by the debugging PR https://github.com/kata-containers/tests/pull/3252.

[2] • Failure [4.868 seconds]
[2] CPU constraints
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:147
[2]   checking container with CPU constraints
[2]   /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:167
[2]     with cpuset-cpus to 0
[2]     /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:195
[2]       /sys/fs/cgroup/cpuset/cpuset.cpus should have 0 [It]
[2]       /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:196
[2] 
[2]       Expected
[2]           <string>: 0
[2]       to equal
[2]           <string>: 0-1
[2] 
[2]       /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:200
[2] ------------------------------
[2] SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
[2] 
[2] Summarizing 1 Failure:
[2] 
[2] [Fail] CPU constraints checking container with CPU constraints with cpuset-cpus to 0 [It] /sys/fs/cgroup/cpuset/cpuset.cpus should have 0 
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:200
[2] 
[2] Ran 15 of 195 Specs in 101.759 seconds
[2] FAIL! -- 14 Passed | 1 Failed | 0 Pending | 180 Skipped

The recent failures from restarting the same CI on this PR (job 2488-2491) are not related to the changes from this PR, and should be instability of the jenkins-ci-cri-containerd itself. The same failure on this CI is also observed from another PR https://github.com/kata-containers/runtime/pull/3135 (CI JOB 2492). The observed error is:

[reset] Removing info for node "ubuntu1804-azure233682" from the ConfigMap "kubeadm-config" in the "kube-system" Namespace
W0216 17:23:30.772546    3918 removeetcdmember.go:61] [reset] failed to remove etcd member: error syncing endpoints with etc: etcdclient: no available endpoints
.Please manually remove this etcd member using etcdctl

likebreath commented 3 years ago

For future reference: This PR serves as the back-porting PR of kata-containers/kata-containers#1406.

kata-containers / runtime

cpuset: when creating container, don't pass cpuset details #3138

Codecov Report