Closed tkrishtop closed 2 hours ago
@tkrishtop Is this for preflight check container
? preflight doesn't run the container as part of check container
, does this happen outside of DCI against this image? To me this looks like DCI is killing this pod due to resource constraints. We see the same thing in Konflux for large images.
Hi @acornett21 thank you for checking.
Is this for preflight check container
yes, here is preflight.log
time="2024-10-02T10:30:06Z" level=debug msg="config file not found, proceeding without it"
time="2024-10-02T10:30:06Z" level=info msg="certification library version" version="1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>"
time="2024-10-02T10:30:07Z" level=info msg="running checks for quay.io/XXX/YYY:tag for platform amd64"
time="2024-10-02T10:30:07Z" level=info msg="target image" image="quay.io/XXX/YYY:tag"
time="2024-10-02T10:30:07Z" level=debug msg="pulling image from target registry"
time="2024-10-02T10:30:07Z" level=debug msg="created temporary directory" path=/tmp/preflight-985576278
time="2024-10-02T10:30:07Z" level=debug msg="exporting and flattening image"
time="2024-10-02T10:30:07Z" level=debug msg="extracting container filesystem" path=/tmp/preflight-985576278/fs
time="2024-10-02T10:30:07Z" level=debug msg="writing container filesystem" outputDirectory=/tmp/preflight-985576278/fs
To me this looks like DCI is killing this pod due to resource constraints.
The error code 143 suggests that the podman process was terminated by a SIGTERM (signal 15).
There could be one of two situations here:
When we test having images pre-pulled locally, the error disappears. That makes me think that we really deal with preflight killing podman pull here and not with the resource constraints.
We see the same thing in Konflux for large images.
Do you use any workarounds to fix the issue?
For clarity, preflight doesn't use podman, so there is not container in container
type situation here. We use crane as a library.
Since this works with cached images, I'm going to make the logical assumption that this is the same problem as konflux with resources on the host. Can you check the CPU/Memory profile on the host to see if DCI is killing this?
For Konflux, users can increase the CPU/Memory of their pipeline or task. I'd assume DCI has a similar feature.
I also just tested the below image a few times which is almost 8GB and preflight has no issue (assuming you have enough /tmp
storage space)
docker.io/sagemathinc/cocalc-docker:latest
I tested this directly on a host with the preflight binary. So no container, CI, etc concerns. This makes me think it's more of a CI issue, since this is double the size of the largest image in question.
I concur with @acornett21 . This is not a preflight issue. If you look closely at the log, you can see that the end of the first line is the start of the preflight log. It's as a message of whatever is calling preflight. So, the "container not running" message is coming from that, not preflight.
Closing as "Not a bug in preflight"
Bug Description
One of Arkady's Telco partners has particularly large images, 2-4GB. They statistically frequently have an error 'container not running'. It seems to be related to the fact that preflight tries to run a container before having it fully-pulled.
Version and Command Invocation
1.10.0
Steps to Reproduce:
Expected Result
preflight running normally 100% of time instead of random occurrences of 'container not running'
Actual Result
random occurrences of 'container not running'
Additional Context
The code source producing an error seems to be here. Should we explore the crane options to ensure that the image was completely pulled and include them into preflight code?