Open BbolroC opened 1 year ago
Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.
Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.
If this issue is also the case for other platforms, this would affect users using a cluster (containerd 1.6.x) created without the snapshotter. What do you think? @stevenhorsman @fidencio
So I think there are potentially two separate things going on, that may, or may not be related:
Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown
issues which we've seen a few times on different platforms and
[ ${#rootfs[@]} -eq 1 ]
which we've only seen on the s390x system. So either it is not related, or the fact that most of the key already exists errors have happened on the AMD nodes that don't run the same tests, so we wouldn't know, so I think we should potentially separate these issues?
Yeah, I was thinking that while writing the comment. I would say the latter doesn't seem @fitzthum wanted to bring on the table. We have to discuss whether the object with key "xxx" already exists
issue will affect users or not in the next release.
I found that test 4 failed due to a stale kata process on the TDX CI machine while running the operator tests.:
/ ps -ef|grep kata
root 717683 716131 0 17:05 ? 00:00:00 sudo -E ./run-local.sh -r kata-qemu-tdx
root 717684 717683 0 17:05 ? 00:00:00 /bin/bash ./run-local.sh -r kata-qemu-tdx
root 721166 672128 0 17:07 pts/29 00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox kata
root 3051702 1 0 Nov01 ? 00:01:50 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /opt/confidential-containers/bin/containerd -id 70c83b7d3bf5ebb5bef7208bf816e2bccfb49962964d4559b50ab80d0112cf26
after I killing the stale kata process, all the tests(including test 4) passed. http://10.112.240.228:8080/job/confidential-containers-operator-main-centos8stream-x86_64-containerd_kata-qemu-tdx-PR/639/console
@BbolroC This could potentially be the reason for the failure of test 4 on the SE machine as well.
Thanks @ChengyuZhu6. I will check that out today if that is the cause for SE after the kata AC meeting (I have a schedule before it)
@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test Test can pull an unencrypted image inside the guest
passed on the SE machine (with the latest commit in a CCv0
branch) when I reverted the acceptance criteria back to [ ${#rootfs[@]} -eq 1 ]
.
@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test
Test can pull an unencrypted image inside the guest
passed on the SE machine (with the latest commit in aCCv0
branch) when I reverted the acceptance criteria back to[ ${#rootfs[@]} -eq 1 ]
.
Thanks, this means when we move this into main
we can go back to the -eq 1
rather than -le 1
. Thanks a lot to Chengyu for discovery the root cause of this mystery!
Description of problem
With a config
IMAGE_OFFLOAD_TO_GUEST=yes
andFORKED_CONTAINERD=no
, a pod creation under IBM Z SE is sometimes stuck in aCreateContainerError
state with the following error:It is a known issue with an upstream containerd
v1.6.8
(https://github.com/kata-containers/tests/pull/5775#issuecomment-1750968483). A quick remedy would be to remove apause
image and get the snapshotter to pull the image. But the newly pulled image is stored in an unexpected location (originally/run/kata-containers/shared/sandboxes/${sandbox_id}/shared
is expected) as follows:This leads to a test failure for
Test can pull an unencrypted image inside the guest
. https://github.com/kata-containers/tests/blob/61806eee754829166478f1c675b3d2e23dc0b4a7/integration/kubernetes/confidential/agent_image.bats#L71This could be resolved by bumping the containerd to v1.7, but is not an option at the moment.
The error looks only happening at http://jenkins.katacontainers.io/job/kata-containers-CCv0-ubuntu-20.04-s390x-SE-daily/. We could skip the test until the update is finished.