Consider nodeset hash to update the completion status

fao89 commented 2 months ago

softwarefactory-project-zuul[bot] commented 2 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9ecb43a0662c4e08ad923064118b6e09

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 29m 17s :x: podified-multinode-edpm-deployment-crc RETRY_LIMIT in 20m 52s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 13m 57s :x: openstack-operator-tempest-multinode RETRY_LIMIT in 23m 24s

fao89 commented 2 months ago

/test openstack-operator-build-deploy-kuttl

fao89 commented 2 months ago

recheck

pablintino commented 2 months ago

I'm fine with it, tested against HCI. I'll run some extra runs be sure it fixes the issue, as it seems unpredictable, despite it showed up for me on each run I did.

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fao89, jpodivin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openstack-k8s-operators/openstack-operator/blob/main/OWNERS)~~ [fao89,jpodivin] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

rabi commented 2 months ago

LGTM, but where is the blocker jira for this? Are we not still in blocker-only mode?

fao89 commented 2 months ago

LGTM, but where is the blocker jira for this? Are we not still in blocker-only mode?

it was uncovered by OSPCIX-352 (tempest starts to run before the nodeset is ready)

rabi commented 2 months ago

LGTM, but where is the blocker jira for this? Are we not still in blocker-only mode?

it was uncovered by OSPCIX-352 (tempest starts to run before the nodeset is ready)

I had looked at that. If one creates a new deployment after updating/patching the nodeset and then checks the nodeset status, they won't have the issue. The issue in the workflow of that job is that we're checking the status before creating a new deployment.

rabi commented 2 months ago

rabi closed this now

Sorry clicked the wrong button.

fao89 commented 2 months ago

actually, they check the status after the deployment: https://github.com/openstack-k8s-operators/architecture/blob/main/automation/vars/default.yaml#L73

2024-07-02 14:06:31,591 p=24790 u=zuul n=ansible | TASK [kustomize_deploy : Apply generated content for examples/va/hci/deployment _raw_params=oc apply -f {{ _cr }}] ***
2024-07-02 14:06:31,591 p=24790 u=zuul n=ansible | Tuesday 02 July 2024  14:06:31 -0400 (0:00:00.073)       0:42:23.561 ********** 
2024-07-02 14:06:32,072 p=24790 u=zuul n=ansible | changed: [localhost]
2024-07-02 14:06:32,093 p=24790 u=zuul n=ansible | TASK [kustomize_deploy : Run Wait Conditions for examples/va/hci/deployment _raw_params={{ wait_condition }}] ***
2024-07-02 14:06:32,094 p=24790 u=zuul n=ansible | Tuesday 02 July 2024  14:06:32 -0400 (0:00:00.502)       0:42:24.064 ********** 
2024-07-02 14:06:32,750 p=24790 u=zuul n=ansible | changed: [localhost] => (item=oc -n openstack wait osdpns openstack-edpm --for condition=Ready --timeout=40m)
2024-07-02 14:06:32,768 p=24790 u=zuul n=ansible | TASK [kustomize_deploy : Stop after applying CRs if requested msg=Failing on demand {{ cifmw_deploy_architecture_stopper }}] ***
2024-07-02 14:06:32,768 p=24790 u=zuul n=ansible | Tuesday 02 July 2024  14:06:32 -0400 (0:00:00.674)       0:42:24.738 **********

https://sf.hosted.upshift.rdu2.redhat.com/logs/96/96/f391b023cb2071c3bc2ce71538724400311d018e/check-gitlab-cee/ci-framework-baremetal-static-node-rhel9-va-hci/f1aa706/logs/controller-0/ci-framework-data/logs/ansible-edpm-deploy.log

rabi commented 2 months ago

actually, they check the status after the deployment: https://github.com/openstack-k8s-operators/architecture/blob/main/automation/vars/default.yaml#L73

Then it could be they are checking too quickly before the nodeset could reconcile (after the event from deployment) or maybe the query[1] we're doing does not show the deployment as we're using an empty context (context.Background()).

[1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L474

fao89 commented 2 months ago

/cherry-pick 18.0.0-proposed

openshift-cherrypick-robot commented 2 months ago

@fao89: new pull request created: #916

In response to [this](https://github.com/openstack-k8s-operators/openstack-operator/pull/913#issuecomment-2214085741): >/cherry-pick 18.0.0-proposed Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

openstack-k8s-operators / openstack-operator

Consider nodeset hash to update the completion status #913