Image rs bump to v0.9.0

stevenhorsman commented 2 weeks ago

Try bumping image-rs to latest release to pull in the latest security fixes

stevenhorsman commented 1 week ago

FYI I think there is possibly a bug in image-rs, maybe about layer ordering? We are seeing some failures like:

failed to create containerd task: failed to create shim task: failed to handle layer: hasher sha256: Operation not supported (os error 95): unknown

e.g. https://github.com/kata-containers/kata-containers/actions/runs/9486782210/job/26146223525?pr=9828 .

It doesn't happen every time, but pulling registry.k8s.io/e2e-test-images/agnhost:2.21 seems to be the most consistent trigger of it

cc @arronwy @Xynnn007 @fidencio

stevenhorsman commented 1 week ago

FYI - I've tried using registry.k8s.io/e2e-test-images/agnhost:2.21 in the image-rs pull tests and can't reproduce the error, so there might be something specific to the confidential image?

stevenhorsman commented 5 hours ago

FYI @Xynnn007 - we are still hitting the same error

#   Warning  Failed     35s (x4 over 85s)  kubelet            Error: failed to create containerd task: failed to create shim task: failed to handle layer: hasher sha256: channel: send failed SendError { .. }: unknown

particularly with the registry.k8s.io/e2e-test-images/agnhost:2.21 image. If we bump that image to the 2.41 it seems to be better, but I don't know if that is just hiding the problem?

Xynnn007 commented 5 hours ago

FYI @Xynnn007 - we are still hitting the same error
#   Warning  Failed     35s (x4 over 85s)  kubelet            Error: failed to create containerd task: failed to create shim task: failed to handle layer: hasher sha256: channel: send failed SendError { .. }: unknown
particularly with the registry.k8s.io/e2e-test-images/agnhost:2.21 image. If we bump that image to the 2.41 it seems to be better, but I don't know if that is just hiding the problem?

@arronwy once talked that this is relavant to the image size. As the pulled image will be stored inside the tmpfs which is only a part of guest memory. Usually 1/10 of guest memory size.

However, I pulled the two images locally and 2.21 is 114MB, and 2.41 is 128 MB. The bigger one is 2.41, which is weird.

Any ideas?

stevenhorsman commented 4 hours ago

@arronwy once talked that this is relavant to the image size. As the pulled image will be stored inside the tmpfs which is only a part of guest memory. Usually 1/10 of guest memory size.

However, I pulled the two images locally and 2.21 is 114MB, and 2.41 is 128 MB. The bigger one is 2.41, which is weird.

Any ideas?

That is a very interesting thought. IIRC the default memory size of the VMs we use is 2GB, so I guess it depends on how image-rs pull things and if there is a time that the layers are all pulled at the same time as the rootfs bundle is created from them so it could end up using more memory, or and maybe agnhost:2.21 has a bigger max layer size, or bigger unpacked size even though 2.41 is bigger overall?

Just guessing here BTW!

I guess the other confusion is that I don't recall us ever seeing this error with previous versions of image-rs (e.g. commit ca6b438), so I wonder if any of the commits in image-rs since 12th March: https://github.com/confidential-containers/guest-components/commits/main/image-rs might be a cause of memory/size/layer handling changes?

We could try and do a bisect, but due to limitation in the kata ci it will be very slow and use a lot of compute power that we are lacking at the moment.

stevenhorsman commented 4 hours ago

I guess this also demonstrates again that we could do with better logging/error handling in image-rs. If we did run out of tmpfs here, then I don't think there is a security concern with throwing an error that hints at that?

kata-containers / kata-containers

Image rs bump to v0.9.0 #9828