openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

ETCD-573: add recert cmd #1227

Closed tjungblu closed 2 months ago

tjungblu commented 8 months ago

you can run it with:

cluster-etcd-operator recert -o asset-out --hips master-1=192.168.2.1,master-2=192.168.2.2,master-3=192.168.2.3

openshift-ci[bot] commented 8 months ago

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-etcd-scaling a8c8457f457d45b7703f8345e1e34c5831aaf496 link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-operator-fips a8c8457f457d45b7703f8345e1e34c5831aaf496 link true /test e2e-operator-fips
ci/prow/e2e-gcp-qe-no-capabilities a8c8457f457d45b7703f8345e1e34c5831aaf496 link false /test e2e-gcp-qe-no-capabilities

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
dusk125 commented 8 months ago

/lgtm

openshift-ci[bot] commented 8 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [dusk125,tjungblu] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
hasbro17 commented 7 months ago

/hold

Haven't forgotten about this but still reviewing.

Not blocking this PR but just wanted to think ahead on how we actually want to run this cmd automatically once we detect etcd is down with expired certs. That may affect how we generate them here.

First is the detection of expired certs. I thinking this would be a health check or polling probe that can either query etcd locally to see a x509: certificate has expired or is not yet valid or just inspect the on-disk cert to check the date of expiry. If this is a sidecar in the operator then we may not have sufficient hostpath permissions to do either of that right? And it can't be in the etcd pod as we need to run this from a single place.

And secondly the distribution step. Since we're generating everything in one place, I'm guessing we have to scp this around to all the other nodes. Not for SNO though.

Lastly since we're only modifying the on-disk cert files, that doesn't change the secrets and bundle configmaps in etcd, that are used by the installer for a new revision. So we need to figure out how we update the cert secrets and configmaps in etcd otherwise the next revision rollout would reuse the expired signer certs in etcd, as opposed to the new ones generated on disk.

Maybe if we relaxed the constraint and assume that the signers aren't expired when the cluster is offline then we can only regenerate the peer/server and client certs on disk, distribute them, bring the cluster up, and then rotate the node cert secrets and configmaps.

Anyway, not a blocker for this PR but we can discuss and flesh that out a bit as well.

openshift-ci-robot commented 7 months ago

@tjungblu: This pull request references ETCD-573 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1227): >you can run it with: > >> cluster-etcd-operator recert -o asset-out --hips master-1=192.168.2.1,master-2=192.168.2.2,master-3=192.168.2.3 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 4 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-merge-robot commented 4 months ago

PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
dusk125 commented 4 months ago

/remove-lifecycle stale