Closed egernst closed 3 years ago
/test
Merging #3138 (22a334d) into master (86f0550) will increase coverage by
0.02%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #3138 +/- ##
==========================================
+ Coverage 50.26% 50.28% +0.02%
==========================================
Files 120 120
Lines 15998 16006 +8
==========================================
+ Hits 8041 8049 +8
Misses 6869 6869
Partials 1088 1088
/test-clh
/test
Looks like a lot of timeouts on CI jobs. Let’s try one more time...
/test
/test
All CLH CI failures are caused by changes from a dependency repo of cloud-hypervisor, which is causing building static binary of CLH to fail.
To fix the CI failures, I sent a PR to the packaging repo (https://github.com/kata-containers/packaging/pull/1192) to not build CLH from source when the released binary is available (same as what we are doing in kata 2.0).
/test-virtiofs
/test-clh
Added a do-not-merge to gate on the CLH CI failures.
The CLH CI failures are now being fixed. But there are quite many other failing CIs. Any comments on those? @egernst @jcvenegas @devimc @GabyCT @chavafg
Edit: with a closer look, all the failures should be related to the PR changes. Please see below for further details, and please share your thoughts. Thanks.
Also, quite many CI (clh-docker, virtiofs, initrd, vsocks, rhel) are failing on the --cpuset-cpus
test which makes sense:
13:54:59 CPU constraints
13:54:59 /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:147
13:54:59 checking container with CPU constraints
13:54:59 /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:167
13:54:59 with cpuset-cpus to 0
13:54:59 /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:195
13:54:59 /sys/fs/cgroup/cpuset/cpuset.cpus should have 0 [It]
13:54:59 /tmp/jenkins/workspace/kata-containers-runtime-ubuntu-1804-PR-cloud-hypeprvisor-docker/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:196
13:54:59
13:54:59 Expected
13:54:59 <string>: 0
13:54:59 to equal
13:54:59 <string>: 0-1
The actual output of /sys/fs/cgroup/cpuset/cpuset.cpus
in the guest ("0-1") actually makes sense with this patch, as we are never constraining the cpu/cpuset cgroups in the guest and we are always having 1 (default vCPU) + requested number of vCPU from docker CLI. To fix the docker test, I think we should adjust the Expected
value to 0-1
. WDYT?
@likebreath , I see that the CI failures are related with the CPU constraints
tests as well as sandbox cgroup
of the shimv2 tests. Maybe related with this change?
I see that the CI failures are related with the
CPU constraints
tests as well assandbox cgroup
of the shimv2 tests. Maybe related with this change?
That's likely. Also, we may need to adjust the shimv2 tests accordingly, as what I mentioned above for the docker tests.
I agree -- we should be updating the tests. @likebreath are you PRing those changes?
I agree -- we should be updating the tests. @likebreath are you PRing those changes?
Sure. I can open a PR to the test repo to fix the docker tests. As I am not familiar with the shimv2 test, can you please take a shot on that one? Thanks. @egernst
Also; it seems we aren't exercising this on 2.0 tests repo, as we were able to merge w/ out test (buggy tests) failures? :-(
I opened a PR to fix the docker cpuset test. Please check the details in the issue (https://github.com/kata-containers/tests/issues/3241) and the PR (https://github.com/kata-containers/tests/pull/3242). Any questions or comments, please let me know. Thank you.
As the PR kata-containers/tests#3242 is landed, restarted the CI jobs on clh-docker, virtiofs, initrd, vsocks, rhel
.
TL;DR: All CI jobs are now passing on this PR, while instability of the CI jenkins-ci-cri-containerd
is observed (that is not related to this PR changes). Merging the PR now.
I checked the history of the CI failure on jenkins-ci-cri-containerd
, where I found the original failure (CI job 2487) was also on the docker cpuset
test (but as a part of "sandbox-cgroup" test). Failure details as below. This failure is already being fixed by the PR kata-containers/tests#3242. Double checked by the debugging PR https://github.com/kata-containers/tests/pull/3252.
[2] • Failure [4.868 seconds]
[2] CPU constraints
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:147
[2] checking container with CPU constraints
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:167
[2] with cpuset-cpus to 0
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:195
[2] /sys/fs/cgroup/cpuset/cpuset.cpus should have 0 [It]
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:196
[2]
[2] Expected
[2] <string>: 0
[2] to equal
[2] <string>: 0-1
[2]
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:200
[2] ------------------------------
[2] SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
[2]
[2] Summarizing 1 Failure:
[2]
[2] [Fail] CPU constraints checking container with CPU constraints with cpuset-cpus to 0 [It] /sys/fs/cgroup/cpuset/cpuset.cpus should have 0
[2] /tmp/jenkins/workspace/kata-containers-runtime-cri-containerd-PR/go/src/github.com/kata-containers/tests/integration/docker/cpu_test.go:200
[2]
[2] Ran 15 of 195 Specs in 101.759 seconds
[2] FAIL! -- 14 Passed | 1 Failed | 0 Pending | 180 Skipped
The recent failures from restarting the same CI on this PR (job 2488-2491) are not related to the changes from this PR, and should be instability of the jenkins-ci-cri-containerd
itself. The same failure on this CI is also observed from another PR https://github.com/kata-containers/runtime/pull/3135 (CI JOB 2492). The observed error is:
[reset] Removing info for node "ubuntu1804-azure233682" from the ConfigMap "kubeadm-config" in the "kube-system" Namespace
W0216 17:23:30.772546 3918 removeetcdmember.go:61] [reset] failed to remove etcd member: error syncing endpoints with etc: etcdclient: no available endpoints
.Please manually remove this etcd member using etcdctl
For future reference: This PR serves as the back-porting PR of kata-containers/kata-containers#1406.
Today we only clear out the cpuset details when doing an update call on existing container/pods. This works in the case of Kubernetes, but not in the case where we are explicitly setting the cpuset details at boot time. For example, if you are running a single container via docker ala:
docker run --cpuset-cpus 0-3 -it alpine sh
What would happen is the cpuset info would be passed in with the container spec for create container request to the agent. At that point in time, there'd only be the defualt number of CPUs available in the guest (1), so you'd be left with cpusets set to 0. Next, we'd hotplug the vCPUs, providing 0-4 CPUs in the guest, but the cpuset would never be updated, leaving the application tied to CPU 0.
Ouch.
Until the day we support cpusets in the guest, let's make sure that we start off clearing the cpuset fields.
Fixes: ####
Signed-off-by: Eric Ernst eric.g.ernst@gmail.com