openstack-k8s-operators / openstack-operator

Meta Operator for OpenStack
https://openstack-k8s-operators.github.io/openstack-operator/
Apache License 2.0
27 stars 76 forks source link

Requeue result from EnsureTLSCerts even when err != nil #990

Closed slagle closed 1 month ago

slagle commented 1 month ago

The return[1] from EnsureTLSCerts can be a non nil err, and a ctrl.Result with a requeue. That result should be returned to force a requeue, instead of just a ctrl.Result{}.

[1] https://github.com/openstack-k8s-operators/openstack-operator/blob/fa62144840323419fdb74594870f82d77d8b5b78/pkg/dataplane/cert.go#L178-L180

Signed-off-by: James Slagle jslagle@redhat.com

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slagle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openstack-k8s-operators/openstack-operator/blob/main/OWNERS)~~ [slagle] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
softwarefactory-project-zuul[bot] commented 1 month ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/116fbc3c10fd430e8e1c11a1a1339ff4

:x: openstack-k8s-operators-content-provider FAILURE in 8m 43s :warning: podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider :warning: cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider :warning: openstack-operator-tempest-multinode SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

slagle commented 1 month ago

/hold

I'm now actually not sure this is needed, and I got a nil pointer dereference when trying it. It's not obvious how that happened, so I'd hold on this for the time being.

softwarefactory-project-zuul[bot] commented 1 month ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/363f5e745ae24fe786ae80b703e4769b

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 58m 58s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 31m 11s :x: cifmw-crc-podified-edpm-baremetal FAILURE in 1h 30m 56s :x: openstack-operator-tempest-multinode FAILURE in 1h 42m 21s

openshift-ci[bot] commented 1 month ago

@slagle: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl 1026968006d7283df4209851cdd44453ef48f835 link true /test openstack-operator-build-deploy-kuttl

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
stuggi commented 1 month ago

when we return requeue we make err==nil: https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/certmanager/certificate.go#L246-L248

the idea was to ignore (and requeue) only the first error, when the certs get created

right, and when the cert is ready, the status condition of the cert cr gets updated, which should result in the owner (one of the openstack-operator controllers) to reconcile

stuggi commented 1 month ago

when we return requeue we make err==nil: https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/certmanager/certificate.go#L246-L248 the idea was to ignore (and requeue) only the first error, when the certs get created

right, and when the cert is ready, the status condition of the cert cr gets updated, which should result in the owner (one of the openstack-operator controllers) to reconcile

Having said that, I think that what is missing https://github.com/openstack-k8s-operators/openstack-operator/pull/991 . the controller is not watching the certs it is creating and therefore not reconciling if there is an update/change

stuggi commented 1 month ago

when we return requeue we make err==nil: https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/certmanager/certificate.go#L246-L248 the idea was to ignore (and requeue) only the first error, when the certs get created

right, and when the cert is ready, the status condition of the cert cr gets updated, which should result in the owner (one of the openstack-operator controllers) to reconcile

Having said that, I think that what is missing #991 . the controller is not watching the certs it is creating and therefore not reconciling if there is an update/change

@vakwetu I am not that familiar with the dataplane controllers. with the PR I submitted the owner of the certs, nodeset, will reconcile. But shouldn't the deployment also reconcile as it is doing more follow up steps when calling EnsureTLSCerts https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/cert.go#L183-L216 ? or should the dataplanedeployment controller watch the nodesets? or always requeue when there is an error with GetTLSNodeCert at https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/cert.go#L180

slagle commented 1 month ago

when we return requeue we make err==nil: https://github.com/openstack-k8s-operators/lib-common/blob/main/modules/certmanager/certificate.go#L246-L248 the idea was to ignore (and requeue) only the first error, when the certs get created

right, and when the cert is ready, the status condition of the cert cr gets updated, which should result in the owner (one of the openstack-operator controllers) to reconcile

Ok, that makes sense then. A ctrl.Result with RequeueAfter will only be returned from the lib-common module when err=nil. So this PR shouldn't be needed.

slagle commented 1 month ago

closing per discussion in comments.