redhat-cop / infra.osbuild

Ansible Collection for management of ostree composer
GNU General Public License v3.0
33 stars 38 forks source link

`wait_compose` module doesn't exit when compose finishes #281

Open sallyom opened 1 year ago

sallyom commented 1 year ago

Builder roles fail by timing out while waiting for the compose to finish, although the compose has already finished several minutes ago. The builder roles are running in ec2 rhel9.2 instance.

json output from vm, shows finished:

    {
        "method": "GET",
        "path": "/compose/finished",
        "status": 200,
        "body": {
            "finished": [
                {
                    "blueprint": "rhde",
                    "compose_type": "edge-container",
                    "id": "01d2e66b-96bc-4477-8978-4d27e16e417f",
                    "image_size": 0,
                    "job_created": 1692152909.3148224,
                    "job_finished": 1692153570.499627,
                    "job_started": 1692152909.3239973,
                    "queue_status": "FINISHED",
                    "version": "0.0.1"
                }
            ]
        }
    },

Run never progresses past the wait_compose.py / Wait for compose to finish task.

TASK [infra.osbuild.builder : Wait for compose to finish] **********************
task path: /runner/requirements_collections/ansible_collections/infra/osbuild/roles/builder/tasks/main.yml:121
--- no useful info ---
matoval commented 1 year ago

Hey @sallyom I spun up an ec2 instance and wasn't able to reproduce this issue. I successfully built an edge-container and edge-commit with no issues.

Are you still experiencing this issue?

sallyom commented 1 year ago

@matoval the issue happens when I'm running the multi-stage edge-installer compose_type.

I'm running AAP in OpenShift, and I have a rhel9.2 builder VM in ec2 configured as the remote host. The first stage, edge-commit completes in the VM successfully. So I know the playbook/inventory/connection is a-ok - and also several weldr API calls happen successfully (the blueprint push, the start compose, etc). The playbook running from AAP never proceeds past this first edge-commit stage because the request result that the edge-commit compose is finished never gets through so the wait_compose task fails due to timeout (it hangs - there is no other error).

Here's the weird thing. I can watch the weldr socket API calls in the rhel9 vm - I see that the wait_compose checks every 20s (the default recheck frequency). The instant the compose finishes, the wait_compose goes silent - it no longer checks in every 20s. So something has triggered that the compose finished, but then silence - and the eventual timeout.

Here's the weirder thing. I can run the exact same playbook with the exact same vars to completion if I instead ssh into the rhel9.2 ec2 instance and configure a localhost inventory. When I run it directly on the host I see the multi-stage composes complete. First the edge-commit and the commit is served as expected, then, an empty blueprint is created, then, the edge-installer compose completes and I have the ISO image.