Open stevenhorsman opened 2 weeks ago
FYI I think there is possibly a bug in image-rs, maybe about layer ordering? We are seeing some failures like:
failed to create containerd task: failed to create shim task: failed to handle layer: hasher sha256: Operation not supported (os error 95): unknown
e.g. https://github.com/kata-containers/kata-containers/actions/runs/9486782210/job/26146223525?pr=9828 .
It doesn't happen every time, but pulling registry.k8s.io/e2e-test-images/agnhost:2.21
seems to be the most consistent trigger of it
cc @arronwy @Xynnn007 @fidencio
FYI - I've tried using registry.k8s.io/e2e-test-images/agnhost:2.21
in the image-rs pull tests and can't reproduce the error, so there might be something specific to the confidential image?
FYI @Xynnn007 - we are still hitting the same error
# Warning Failed 35s (x4 over 85s) kubelet Error: failed to create containerd task: failed to create shim task: failed to handle layer: hasher sha256: channel: send failed SendError { .. }: unknown
particularly with the registry.k8s.io/e2e-test-images/agnhost:2.21
image. If we bump that image to the 2.41
it seems to be better, but I don't know if that is just hiding the problem?
FYI @Xynnn007 - we are still hitting the same error
# Warning Failed 35s (x4 over 85s) kubelet Error: failed to create containerd task: failed to create shim task: failed to handle layer: hasher sha256: channel: send failed SendError { .. }: unknown
particularly with the
registry.k8s.io/e2e-test-images/agnhost:2.21
image. If we bump that image to the2.41
it seems to be better, but I don't know if that is just hiding the problem?
@arronwy once talked that this is relavant to the image size. As the pulled image will be stored inside the tmpfs
which is only a part of guest memory. Usually 1/10 of guest memory size.
However, I pulled the two images locally and 2.21
is 114MB, and 2.41
is 128 MB. The bigger one is 2.41
, which is weird.
Any ideas?
@arronwy once talked that this is relavant to the image size. As the pulled image will be stored inside the
tmpfs
which is only a part of guest memory. Usually 1/10 of guest memory size.However, I pulled the two images locally and
2.21
is 114MB, and2.41
is 128 MB. The bigger one is2.41
, which is weird.Any ideas?
That is a very interesting thought. IIRC the default memory size of the VMs we use is 2GB, so I guess it depends on how image-rs pull things and if there is a time that the layers are all pulled at the same time as the rootfs bundle is created from them so it could end up using more memory, or and maybe agnhost:2.21
has a bigger max layer size, or bigger unpacked size even though 2.41
is bigger overall?
Just guessing here BTW!
I guess the other confusion is that I don't recall us ever seeing this error with previous versions of image-rs (e.g. commit ca6b438
), so I wonder if any of the commits in image-rs since 12th March: https://github.com/confidential-containers/guest-components/commits/main/image-rs might be a cause of memory/size/layer handling changes?
We could try and do a bisect, but due to limitation in the kata ci it will be very slow and use a lot of compute power that we are lacking at the moment.
I guess this also demonstrates again that we could do with better logging/error handling in image-rs. If we did run out of tmpfs here, then I don't think there is a security concern with throwing an error that hints at that?