Error 'container not running' on 'writing container filesystem' step

tkrishtop commented 1 week ago

Bug Description

One of Arkady's Telco partners has particularly large images, 2-4GB. They statistically frequently have an error 'container not running'. It seems to be related to the fact that preflight tries to run a container before having it fully-pulled.

Version and Command Invocation

1.10.0

Steps to Reproduce:

failed: [jumphost] (item={'started': 1, 'finished': 0, 'ansible_job_id': '761330612078.1855963', 'results_file': '/var/lib/dci-openshift-app-agent/.ansible_async/761330612078.1855963', 'changed': True, 'failed': False, 'item': [{'container_image': 'quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0'}, {'changed': True, 'path': '/tmp/preflight_container_artifacts_3u3uvfok', 'uid': 1002, 'gid': 981, 'owner': 'dci-openshift-app-agent', 'group': 'dci-openshift-app-agent', 'mode': '0700', 'state': 'directory', 'secontext': 'unconfined_u:object_r:user_tmp_t:s0', 'size': 6, 'invocation': {'module_args': {'state': 'directory', 'prefix': 'preflight_container_artifacts_', 'suffix': '', 'path': None}}, 'failed': False, 'item': {'container_image': 'quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0'}, 'ansible_loop_var': 'item'}, ' '], 'ansible_loop_var': 'item'}) => {"ansible_job_id": "761330612078.1855963", "ansible_loop_var": "async_result_item", "async_result_item": {"ansible_job_id": "761330612078.1855963", "ansible_loop_var": "item", "changed": true, "failed": false, "finished": 0, "item": [{"container_image": "quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0"}, {"ansible_loop_var": "item", "changed": true, "failed": false, "gid": 981, "group": "dci-openshift-app-agent", "invocation": {"module_args": {"path": null, "prefix": "preflight_container_artifacts_", "state": "directory", "suffix": ""}}, "item": {"container_image": "quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0"}, "mode": "0700", "owner": "dci-openshift-app-agent", "path": "/tmp/preflight_container_artifacts_3u3uvfok", "secontext": "unconfined_u:object_r:user_tmp_t:s0", "size": 6, "state": "directory", "uid": 1002}, " "], "results_file": "/var/lib/dci-openshift-app-agent/.ansible_async/761330612078.1855963", "started": 1}, "attempts": 2, "changed": true, "cmd": ["podman", "run", "--rm", "--privileged", "-e", "PFLT_JUNIT=true", "-e", "PFLT_LOGLEVEL=trace", "-e", "PFLT_ARTIFACTS=/artifacts", "-e", "PFLT_LOGFILE=/artifacts/preflight.log", "-e", "PFLT_DOCKERCONFIG=/opt/config.json", "-v", "/tmp/preflight_container_artifacts_3u3uvfok:/artifacts", "-v", "/tmp/preflight_tmp_dir.mzhit2o4/config.json:/opt/config.json", "quay.io/opdev/preflight:1.10.0", "check", "container", "quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0"], "delta": "0:00:05.162823", "end": "2024-09-27 07:54:11.786187", "finished": 1, "msg": "non-zero return code", "rc": 143, "start": "2024-09-27 07:54:06.623364", "stderr": "time=\"2024-09-27T12:54:06Z\" level=debug msg=\"config file not found, proceeding without it\"
time=\"2024-09-27T12:54:06Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"
time=\"2024-09-27T12:54:06Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0 for platform amd64\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0\"
time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"
time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-72102398
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-72102398/fs
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-72102398/fs
time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\"", "stderr_lines": ["time=\"2024-09-27T12:54:06Z\" level=debug msg=\"config file not found, proceeding without it\"", "time=\"2024-09-27T12:54:06Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"", "time=\"2024-09-27T12:54:06Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0 for platform amd64\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimpaps:CMM24.11.0.0\"", "time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"", "time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-72102398", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-72102398/fs", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-72102398/fs", "time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\""], "stdout": "", "stdout_lines": []}
failed: [jumphost] (item={'started': 1, 'finished': 0, 'ansible_job_id': '867988553009.1855982', 'results_file': '/var/lib/dci-openshift-app-agent/.ansible_async/867988553009.1855982', 'changed': True, 'failed': False, 'item': [{'container_image': 'quay.io/nepravee/cmm-aimipps:CMM24.11.0.0'}, {'changed': True, 'path': '/tmp/preflight_container_artifacts_rlrtl_ob', 'uid': 1002, 'gid': 981, 'owner': 'dci-openshift-app-agent', 'group': 'dci-openshift-app-agent', 'mode': '0700', 'state': 'directory', 'secontext': 'unconfined_u:object_r:user_tmp_t:s0', 'size': 6, 'invocation': {'module_args': {'state': 'directory', 'prefix': 'preflight_container_artifacts_', 'suffix': '', 'path': None}}, 'failed': False, 'item': {'container_image': 'quay.io/nepravee/cmm-aimipps:CMM24.11.0.0'}, 'ansible_loop_var': 'item'}, ' '], 'ansible_loop_var': 'item'}) => {"ansible_job_id": "867988553009.1855982", "ansible_loop_var": "async_result_item", "async_result_item": {"ansible_job_id": "867988553009.1855982", "ansible_loop_var": "item", "changed": true, "failed": false, "finished": 0, "item": [{"container_image": "quay.io/nepravee/cmm-aimipps:CMM24.11.0.0"}, {"ansible_loop_var": "item", "changed": true, "failed": false, "gid": 981, "group": "dci-openshift-app-agent", "invocation": {"module_args": {"path": null, "prefix": "preflight_container_artifacts_", "state": "directory", "suffix": ""}}, "item": {"container_image": "quay.io/nepravee/cmm-aimipps:CMM24.11.0.0"}, "mode": "0700", "owner": "dci-openshift-app-agent", "path": "/tmp/preflight_container_artifacts_rlrtl_ob", "secontext": "unconfined_u:object_r:user_tmp_t:s0", "size": 6, "state": "directory", "uid": 1002}, " "], "results_file": "/var/lib/dci-openshift-app-agent/.ansible_async/867988553009.1855982", "started": 1}, "attempts": 1, "changed": true, "cmd": ["podman", "run", "--rm", "--privileged", "-e", "PFLT_JUNIT=true", "-e", "PFLT_LOGLEVEL=trace", "-e", "PFLT_ARTIFACTS=/artifacts", "-e", "PFLT_LOGFILE=/artifacts/preflight.log", "-e", "PFLT_DOCKERCONFIG=/opt/config.json", "-v", "/tmp/preflight_container_artifacts_rlrtl_ob:/artifacts", "-v", "/tmp/preflight_tmp_dir.mzhit2o4/config.json:/opt/config.json", "quay.io/opdev/preflight:1.10.0", "check", "container", "quay.io/nepravee/cmm-aimipps:CMM24.11.0.0"], "delta": "0:00:04.910753", "end": "2024-09-27 07:54:11.786600", "finished": 1, "msg": "non-zero return code", "rc": 143, "start": "2024-09-27 07:54:06.875847", "stderr": "time=\"2024-09-27T12:54:07Z\" level=debug msg=\"config file not found, proceeding without it\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"
time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimipps:CMM24.11.0.0 for platform amd64\"
time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimipps:CMM24.11.0.0\"
time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"
time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-1331259970
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-1331259970/fs
time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-1331259970/fs
time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\"", "stderr_lines": ["time=\"2024-09-27T12:54:07Z\" level=debug msg=\"config file not found, proceeding without it\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"certification library version\" version=\"1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>\"", "time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"running checks for quay.io/nepravee/cmm-aimipps:CMM24.11.0.0 for platform amd64\"", "time=\"2024-09-27T12:54:07Z\" level=info msg=\"target image\" image=\"quay.io/nepravee/cmm-aimipps:CMM24.11.0.0\"", "time=\"2024-09-27T12:54:07Z\" level=debug msg=\"pulling image from target registry\"", "time=\"2024-09-27T12:54:07Z\" level=trace msg=\"entering preflight keychain Resolve\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"created temporary directory\" path=/tmp/preflight-1331259970", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"exporting and flattening image\"", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"extracting container filesystem\" path=/tmp/preflight-1331259970/fs", "time=\"2024-09-27T12:54:08Z\" level=debug msg=\"writing container filesystem\" outputDirectory=/tmp/preflight-1331259970/fs", "time=\"2024-09-27T07:54:11-05:00\" level=error msg=\"container not running\""], "stdout": "", "stdout_lines": []}

Expected Result

preflight running normally 100% of time instead of random occurrences of 'container not running'

Actual Result

random occurrences of 'container not running'

Additional Context

The code source producing an error seems to be here. Should we explore the crane options to ensure that the image was completely pulled and include them into preflight code?

acornett21 commented 1 week ago

@tkrishtop Is this for preflight check container? preflight doesn't run the container as part of check container, does this happen outside of DCI against this image? To me this looks like DCI is killing this pod due to resource constraints. We see the same thing in Konflux for large images.

tkrishtop commented 1 week ago

Hi @acornett21 thank you for checking.

Is this for preflight check container

yes, here is preflight.log

time="2024-10-02T10:30:06Z" level=debug msg="config file not found, proceeding without it"
time="2024-10-02T10:30:06Z" level=info msg="certification library version" version="1.10.0 <commit: c9048da99aae76ddee5a708edcc94e14c034cd1d>"
time="2024-10-02T10:30:07Z" level=info msg="running checks for quay.io/XXX/YYY:tag for platform amd64"
time="2024-10-02T10:30:07Z" level=info msg="target image" image="quay.io/XXX/YYY:tag"
time="2024-10-02T10:30:07Z" level=debug msg="pulling image from target registry"
time="2024-10-02T10:30:07Z" level=debug msg="created temporary directory" path=/tmp/preflight-985576278
time="2024-10-02T10:30:07Z" level=debug msg="exporting and flattening image"
time="2024-10-02T10:30:07Z" level=debug msg="extracting container filesystem" path=/tmp/preflight-985576278/fs
time="2024-10-02T10:30:07Z" level=debug msg="writing container filesystem" outputDirectory=/tmp/preflight-985576278/fs

To me this looks like DCI is killing this pod due to resource constraints.

The error code 143 suggests that the podman process was terminated by a SIGTERM (signal 15).
There could be one of two situations here:

Resource Constraints
Preflight timeout killing podman pull

When we test having images pre-pulled locally, the error disappears. That makes me think that we really deal with preflight killing podman pull here and not with the resource constraints.

We see the same thing in Konflux for large images.

Do you use any workarounds to fix the issue?

acornett21 commented 1 week ago

For clarity, preflight doesn't use podman, so there is not container in container type situation here. We use crane as a library.

Since this works with cached images, I'm going to make the logical assumption that this is the same problem as konflux with resources on the host. Can you check the CPU/Memory profile on the host to see if DCI is killing this?

For Konflux, users can increase the CPU/Memory of their pipeline or task. I'd assume DCI has a similar feature.

acornett21 commented 1 week ago

I also just tested the below image a few times which is almost 8GB and preflight has no issue (assuming you have enough /tmp storage space)

docker.io/sagemathinc/cocalc-docker:latest

I tested this directly on a host with the preflight binary. So no container, CI, etc concerns. This makes me think it's more of a CI issue, since this is double the size of the largest image in question.

bcrochet commented 2 hours ago

I concur with @acornett21 . This is not a preflight issue. If you look closely at the log, you can see that the end of the first line is the start of the preflight log. It's as a message of whatever is calling preflight. So, the "container not running" message is coming from that, not preflight.

Closing as "Not a bug in preflight"

redhat-openshift-ecosystem / openshift-preflight