openstack-k8s-operators / openstack-operator

Meta Operator for OpenStack
https://openstack-k8s-operators.github.io/openstack-operator/
Apache License 2.0
27 stars 76 forks source link

Enforce update order for OVN for Ctlplane/EDPM #792

Closed dprince closed 3 months ago

dprince commented 4 months ago

Enforce update order for OVN for Ctlplane/EDPM Jira: OSPRH-6732

dprince commented 4 months ago

/hold

softwarefactory-project-zuul[bot] commented 4 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/8e0ea450916246cd80dd8146d3b20e6d

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 26m 44s :x: podified-multinode-edpm-deployment-crc FAILURE in 1h 08m 29s :x: cifmw-crc-podified-edpm-baremetal FAILURE in 1h 08m 36s :x: openstack-operator-tempest-multinode FAILURE in 1h 06m 35s

softwarefactory-project-zuul[bot] commented 4 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/1683d7919bce48a88c1732103ab81f82

:x: openstack-k8s-operators-content-provider FAILURE in 11m 16s :warning: podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider :warning: cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider :warning: openstack-operator-tempest-multinode SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

softwarefactory-project-zuul[bot] commented 4 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/21aabfd5a3c940a49daa0d26f131d5c3

:x: openstack-k8s-operators-content-provider FAILURE in 10m 52s :warning: podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider :warning: cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider :warning: openstack-operator-tempest-multinode SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

dprince commented 4 months ago

recheck

softwarefactory-project-zuul[bot] commented 4 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/b8c58d7c412e4c1388e0bb7edcdaf75e

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 50m 37s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 17m 44s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 15m 37s :x: openstack-operator-tempest-multinode RETRY_LIMIT in 24m 06s

dprince commented 4 months ago

recheck

softwarefactory-project-zuul[bot] commented 3 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/662941fc411642439586560bc4a94706

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 2h 08m 08s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 23m 34s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 24m 21s :x: openstack-operator-tempest-multinode FAILURE in 1h 52m 11s

dprince commented 3 months ago

recheck

softwarefactory-project-zuul[bot] commented 3 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/a0359f9137894b62b3af503929e991b8

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 57m 44s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 22m 46s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 20m 15s :x: openstack-operator-tempest-multinode FAILURE in 1h 42m 24s

softwarefactory-project-zuul[bot] commented 3 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://review.rdoproject.org/zuul/buildset/7aa36ac7e4c4487c9334f98a072cbaaf

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 2h 01m 10s :x: podified-multinode-edpm-deployment-crc FAILURE in 1h 39m 23s :x: cifmw-crc-podified-edpm-baremetal FAILURE in 1h 33m 43s :x: openstack-operator-tempest-multinode FAILURE in 1h 43m 48s

dprince commented 3 months ago

looks good for me, the only thing I am not sure if I understand correctly is how all those changes in various {service}.go files are related to the PR description :)

in order to ensure update order we need to make sure services have been deployed (ready state, and observed generation checks).

dprince commented 3 months ago

recheck

booxter commented 3 months ago

I haven't tried to run it, but looking at the code, the order of updates looks correct.

I couldn't figure out from just reading the code how we guarantee that only ovn-controller (and not e.g. nova-compute) is updated on DP, before CP reconcileNormal is triggered. Will it be managed using a separate Deployment that would run a single ovn DP service? (I assume updating nova-compute before CP nova services is a problem.) Perhaps someone could ELI5 to me. (Thank you.)

bshephar commented 3 months ago

I haven't tried to run it, but looking at the code, the order of updates looks correct.

I couldn't figure out from just reading the code how we guarantee that only ovn-controller (and not e.g. nova-compute) is updated on DP, before CP reconcileNormal is triggered. Will it be managed using a separate Deployment that would run a single ovn DP service? (I assume updating nova-compute before CP nova services is a problem.) Perhaps someone could ELI5 to me. (Thank you.)

Also trying to follow this. So we set the condition here: https://github.com/openstack-k8s-operators/openstack-operator/pull/792/files#diff-32500fc60d27debdcd1f64468b83c0a318d79fa55591b2e17a5cb935d9fde650R248

But that just prevents the controller from continuing the update until the condition has been satisfied. It seems that the actual update of OVN on the Dataplane would need to be a manual process as described here:

https://github.com/openstack-k8s-operators/dataplane-operator/blob/main/docs/assemblies/proc_updating-the-data-plane-ovn.adoc

So this would happen first and then pause until the condition get satisfied. To satisfy the condition, the image deployed would need to match the image the update is expecting as determined by: https://github.com/openstack-k8s-operators/openstack-operator/pull/792/files#diff-308714db370a47145837acae5ff60352d9f352513007441b8c694f2c45c1031dR38-R53

Then the update can continue.

At least that's my 10 minute read on what's happening here. The answer is that the user will create the deployment limited to the OVN service like:

apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
  name: edpm-deployment-ipam-update
spec:
  nodeSets:
    - openstack-edpm-ipam
    - <nodeSet_name>
    - ...
    - <nodeSet_name>
  servicesOverride:
    - ovn
dprince commented 3 months ago

Dataplane updates are manual for GA. There are some Jira's filed (https://issues.redhat.com/browse/OSPRH-6421) which might help us fully streamline the minor update workflow.

dprince commented 3 months ago

I do think we could also validate based on conditions set on the OpenStackVersion resource that when Dataplane resources get executed we are in the correct state. So for example if we need to just execute an OVN playbook we could have a crude validation on that. The adminstrator could always override this with Ansible, but I think a simple check like this could help us further guard the workflow in the future.

booxter commented 3 months ago

Thank you @bshephar @dprince this (rolling in Deployment just for ovn service) makes sense. It's a bit more leg work for a user but the main point is we have a plan.

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprince, stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openstack-k8s-operators/openstack-operator/blob/main/OWNERS)~~ [dprince,stuggi] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment