openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

ETCD-512: refactoring the cert signer controller #1177

Closed tjungblu closed 9 months ago

tjungblu commented 10 months ago

This PR will

The consequence of merging this PR is:

tjungblu commented 10 months ago

/hold

tjungblu commented 10 months ago

/retest

tjungblu commented 10 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 10 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9c96d870-b457-11ee-991a-701add1801b7-0

tjungblu commented 10 months ago

/payload 4.16 nightly blocking

tjungblu commented 10 months ago

/retest

openshift-ci[bot] commented 10 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/349a8620-b48b-11ee-80d4-c5cc632357a6-0

tjungblu commented 10 months ago

test cluster seemingly went down, trying again

/payload 4.16 nightly blocking

openshift-ci[bot] commented 10 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a0bdbe60-b51e-11ee-8ff5-e60f551db126-0

tjungblu commented 10 months ago

/retest

tjungblu commented 10 months ago

seems under some condition the events for missing resources get emitted more frequently than before. I'm checking those out in more detail.

tjungblu commented 10 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 10 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/47a0e700-b620-11ee-8e3b-cb0f9d6219f6-0

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/cf0d57f0-b904-11ee-9d81-b3ffd2920b91-0

openshift-ci-robot commented 9 months ago

@tjungblu: This pull request references ETCD-512 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1177): >This PR will >* replace the existing cert rotation logic with more battle tested ones from library-go >* create new signer certificates (metrics + serving) in openshift-etcd namespace, in addition to existing ones in openshift-config >* create new server certificates (peer, serving, serving-metrics) >* create new client certificates (etcd-client, etcd-metrics) >* bundle existing signer certificates with newly created CAs (to stay backward compatible) > >The consequence of merging this PR is: >* an additional static pod rollout during installation and upgrades (slightly longer install time expected, upgrades should be unaffected) >* all existing certs are rotated with existing old and new signers, which are distributed to all nodes for actual signer rotation later on > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/123f2240-bb96-11ee-8540-db573853c215-0

tjungblu commented 9 months ago

I think we also need to increase the flake threshold for the time being: https://github.com/openshift/origin/blob/ec6f7585f45704ccafaaed76772a87d8f96cbcab/pkg/monitortests/etcd/legacyetcdmonitortests/pathological_events.go#L9-L19

that static pod additional rollout (I believe) causes the installer to choke a little more often than before

edit: https://github.com/openshift/origin/pull/28557

openshift-ci-robot commented 9 months ago

@tjungblu: This pull request references ETCD-512 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1177): >This PR will >* replace the existing cert rotation logic with more battle tested ones from library-go >* create new signer certificates (metrics + serving) in openshift-etcd namespace, in addition to existing ones in openshift-config >* create new server certificates (peer, serving, serving-metrics) >* create new client certificates (etcd-client, etcd-metrics) >* bundle existing signer certificates with newly created CAs (to stay backward compatible) > >The consequence of merging this PR is: >* an additional static pod rollout during installation and upgrades (slightly longer install/upgrade time expected) >* all existing certs are rotated with existing old and new signers, which are distributed to all nodes for actual signer rotation later on > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 9 months ago

/test ?

openshift-ci[bot] commented 9 months ago

@tjungblu: The following commands are available to trigger required jobs:

The following commands are available to trigger optional jobs:

Use /test all to run the following jobs that were automatically triggered:

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1177#issuecomment-1911665775): >/test ? Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tjungblu commented 9 months ago

/test configmap-scale /test e2e-aws /test e2e-azure /test e2e-azure-ovn-etcd-scaling /test e2e-gcp /test e2e-gcp-ovn-etcd-scaling /test e2e-metal-ipi /test e2e-metal-ipi-serial-ipv4 /test e2e-metal-single-node-live-iso /test e2e-vsphere-ovn-etcd-scaling

tjungblu commented 9 months ago

so this seems to be also a cache sync issue on the apiserver-operator:

2024-01-25T17:36:05.791965265Z I0125 17:36:05.791900       1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"c8ca232e-a8f7-4567-919b-66c229939652", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RequiredInstallerResourcesMissing' secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,kubelet-client,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-10,localhost-recovery-client-token-10,localhost-recovery-serving-certkey-10
2024-01-25T17:36:05.792400417Z E0125 17:36:05.792325       1 base_controller.go:268] InstallerController reconciliation failed: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,kubelet-client,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-10,localhost-recovery-client-token-10,localhost-recovery-serving-certkey-10]
2024-01-25T17:36:06.005475338Z I0125 17:36:06.005418       1 reflector.go:351] Caches populated for *v1.Secret from k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229
2024-01-25T17:36:06.015869416Z I0125 17:36:06.015809       1 base_controller.go:73] Caches are synced for CertRotationController 
2024-01-25T17:36:06.015869416Z I0125 17:36:06.015848       1 base_controller.go:110] Starting #1 worker of CertRotationController controller ...
2024-01-25T17:36:06.015921058Z I0125 17:36:06.015876       1 base_controller.go:73] Caches are synced for CertRotationController 

...

that shuts the event up entirely. Seems a race between the installer controller from library-go and some of the secret informers?

tjungblu commented 9 months ago

Created https://issues.redhat.com/browse/OCPBUGS-28243 for the apiserver operator, I'll see that I can fix this race in CEO - otherwise we can simply increase the thresholds with the origin PR.

tjungblu commented 9 months ago

/retest

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/08916f80-bc3d-11ee-888a-6fe982462ef8-0

tjungblu commented 9 months ago

/retest

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/85e912e0-bc64-11ee-84c6-0bcedc6654ce-0

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbb632b0-be86-11ee-85c8-372b57d8451a-0

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: An error was encountered. No known errors were detected, please see the full error message for details.

Full error message. could not create PullRequestPayloadQualificationRun: client rate limiter Wait returned an error: context canceled

Please contact an administrator to resolve this issue.

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/411b8590-bf47-11ee-9186-572c7d58a7db-0

tjungblu commented 9 months ago

/retest

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/49e0f280-c011-11ee-8794-fac7d04776b2-0

tjungblu commented 9 months ago

/retest-required

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/85968190-c03f-11ee-8d0c-083ebd0bf674-0

tjungblu commented 9 months ago

/hold cancel

tjungblu commented 9 months ago

/payload 4.16 nightly blocking

triggering another run, last one seems botched again by the app cluster issues

openshift-ci[bot] commented 9 months ago

@tjungblu: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/80eb2980-c061-11ee-855a-9115017afbc4-0

hasbro17 commented 9 months ago

Running another for good measure

/payload 4.16 nightly blocking

openshift-ci[bot] commented 9 months ago

@hasbro17: trigger 8 job(s) of type blocking for the nightly release of OCP 4.16

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/28310470-c0c7-11ee-9ac3-113064ab5974-0

tjungblu commented 9 months ago

one last run for the squash:

/payload 4.16 nightly blocking