kata-containers / tests

Kata Containers tests, CI, and metrics
https://katacontainers.io/
Apache License 2.0
139 stars 196 forks source link

CC: newly pulled pause image by snapshotter stored in an unexpected location #5781

Open BbolroC opened 1 year ago

BbolroC commented 1 year ago

Description of problem

With a config IMAGE_OFFLOAD_TO_GUEST=yes and FORKED_CONTAINERD=no, a pod creation under IBM Z SE is sometimes stuck in a CreateContainerError state with the following error:

Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown

It is a known issue with an upstream containerd v1.6.8 (https://github.com/kata-containers/tests/pull/5775#issuecomment-1750968483). A quick remedy would be to remove a pause image and get the snapshotter to pull the image. But the newly pulled image is stored in an unexpected location (originally /run/kata-containers/shared/sandboxes/${sandbox_id}/shared is expected) as follows:

# ls -lah /run/kata-containers/shared/sandboxes/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1/shared
total 16K
drwxr-x--- 3 root root 160 Oct 12 11:04 .
drwx------ 5 root root 100 Oct 12 11:04 ..
-rw-r--r-- 1 root root 103 Oct 12 11:04 a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1-e9967091f9448d8a-resolv.conf
-rw-r--r-- 1 root root  11 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-44e4e6f3b60b2926-hostname
-rw-r--r-- 1 root root 103 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-4c6bb0d5b7fc98ff-resolv.conf
-rw-rw-rw- 1 root root   0 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-83476f850307d009-termination-log
-rw-r--r-- 1 root root 205 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-844b44105b991bcd-hosts
drwxrwxrwt 3 root root 140 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-ab6d937a4d086125-serviceaccount
# ls -lah /run/containerd/io.containerd.runtime.v2.task/k8s.io/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1/
total 28K
drwx------  3 root root  200 Oct 12 11:04 .
drwx--x--x 20 root root  400 Oct 12 11:04 ..
-rw-r--r--  1 root root   89 Oct 12 11:04 address
-rw-r--r--  1 root root 8.4K Oct 12 11:04 config.json
prwx------  1 root root    0 Oct 12 11:07 log
-rw-r--r--  1 root root  101 Oct 12 11:04 monitor_address
drwx--x--x  2 root root   40 Oct 12 11:04 rootfs
-rw-------  1 root root   32 Oct 12 11:04 shim-binary-path
-rw-r--r--  1 root root    7 Oct 12 11:04 shim.pid
lrwxrwxrwx  1 root root  121 Oct 12 11:04 work -> /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1

This leads to a test failure for Test can pull an unencrypted image inside the guest. https://github.com/kata-containers/tests/blob/61806eee754829166478f1c675b3d2e23dc0b4a7/integration/kubernetes/confidential/agent_image.bats#L71

This could be resolved by bumping the containerd to v1.7, but is not an option at the moment.

The error looks only happening at http://jenkins.katacontainers.io/job/kata-containers-CCv0-ubuntu-20.04-s390x-SE-daily/. We could skip the test until the update is finished.

fitzthum commented 1 year ago

Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.

BbolroC commented 1 year ago

Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.

If this issue is also the case for other platforms, this would affect users using a cluster (containerd 1.6.x) created without the snapshotter. What do you think? @stevenhorsman @fidencio

stevenhorsman commented 1 year ago

So I think there are potentially two separate things going on, that may, or may not be related:

Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown

issues which we've seen a few times on different platforms and

[ ${#rootfs[@]} -eq 1 ] 

which we've only seen on the s390x system. So either it is not related, or the fact that most of the key already exists errors have happened on the AMD nodes that don't run the same tests, so we wouldn't know, so I think we should potentially separate these issues?

BbolroC commented 1 year ago

Yeah, I was thinking that while writing the comment. I would say the latter doesn't seem @fitzthum wanted to bring on the table. We have to discuss whether the object with key "xxx" already exists issue will affect users or not in the next release.

ChengyuZhu6 commented 1 year ago

I found that test 4 failed due to a stale kata process on the TDX CI machine while running the operator tests.:

/ ps -ef|grep kata
root      717683  716131  0 17:05 ?        00:00:00 sudo -E ./run-local.sh -r kata-qemu-tdx
root      717684  717683  0 17:05 ?        00:00:00 /bin/bash ./run-local.sh -r kata-qemu-tdx
root      721166  672128  0 17:07 pts/29   00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox kata
root     3051702       1  0 Nov01 ?        00:01:50 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /opt/confidential-containers/bin/containerd -id 70c83b7d3bf5ebb5bef7208bf816e2bccfb49962964d4559b50ab80d0112cf26

after I killing the stale kata process, all the tests(including test 4) passed. http://10.112.240.228:8080/job/confidential-containers-operator-main-centos8stream-x86_64-containerd_kata-qemu-tdx-PR/639/console

ChengyuZhu6 commented 1 year ago

@BbolroC This could potentially be the reason for the failure of test 4 on the SE machine as well.

BbolroC commented 1 year ago

Thanks @ChengyuZhu6. I will check that out today if that is the cause for SE after the kata AC meeting (I have a schedule before it)

BbolroC commented 1 year ago

@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test Test can pull an unencrypted image inside the guest passed on the SE machine (with the latest commit in a CCv0 branch) when I reverted the acceptance criteria back to [ ${#rootfs[@]} -eq 1 ].

stevenhorsman commented 1 year ago

@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test Test can pull an unencrypted image inside the guest passed on the SE machine (with the latest commit in a CCv0 branch) when I reverted the acceptance criteria back to [ ${#rootfs[@]} -eq 1 ].

Thanks, this means when we move this into main we can go back to the -eq 1 rather than -le 1. Thanks a lot to Chengyu for discovery the root cause of this mystery!