openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

ETCD-535: Manual CA rotation should rotate all leaf certs #1200

Closed tjungblu closed 8 months ago

tjungblu commented 9 months ago

/hold

The cool thing is that we're now able to "swap" signers with the existing logic with:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

which is effectively overwriting the new signer from openshift-etcd into the old signer in openshift-config. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via etcd-all-certs.

Manual rotation is then just two step manual process:

Generate new signer:

$ oc delete secret etcd-signer -n openshift-etcd

... wait for the rollout ...

Replace the old signer with the new signer:

$ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f -

openshift-ci-robot commented 9 months ago

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1200): >/hold > >This is expected to increase the CPU usage, given the node update frequency and the amount of cert parsing/validation introduced. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 9 months ago

/test e2e-operator

tjungblu commented 9 months ago

/test e2e-operator

tjungblu commented 9 months ago

/test e2e-operator

tjungblu commented 9 months ago

/test e2e-operator

tjungblu commented 9 months ago

/test unit

tjungblu commented 9 months ago

/test e2e-operator /test unit

tjungblu commented 9 months ago

/test e2e-operator

tjungblu commented 9 months ago

/test e2e-operator /test unit

tjungblu commented 9 months ago

/test e2e-operator /test unit

tjungblu commented 9 months ago

/test e2e-operator /test unit

tjungblu commented 9 months ago

/test e2e-operator

openshift-ci-robot commented 9 months ago

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1200): >/hold > >This is expected to increase the CPU usage, given the node update frequency and the amount of cert parsing/validation introduced. > >--- > >Some results so far: >* the node update along their heartbeat interval, so the controller triggers every couple of seconds -> reducing the informer to only the CP nodes >* with the CP nodes only, we add about 5% CPU usage to the operator - biggest chunk is still TLS handshakes with etcd (about 30-40% - which is still too high for my taste, given that we cache the clients) > >The cool thing is that we're now able to "swap" signers with the existing logic with: > >> $ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f - > >which is effectively overwriting the new signer from `openshift-etcd` into the old signer in `openshift-config`. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via `etcd-all-certs`. > >Manual rotation is then just two step manual process: > >Generate new signer: >> $ oc delete secret etcd-signer -n openshift-etcd > >... wait for the rollout ... > >Replace the old signer with the new signer: > >> $ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f - > > > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 8 months ago

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1200): >/hold > >This is expected to increase the CPU usage, given the node update frequency and the amount of cert parsing/validation introduced. > >--- > >Some results so far: >* measure.... > >--- > > >The cool thing is that we're now able to "swap" signers with the existing logic with: > >> $ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f - > >which is effectively overwriting the new signer from `openshift-etcd` into the old signer in `openshift-config`. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via `etcd-all-certs`. > >Manual rotation is then just two step manual process: > >Generate new signer: >> $ oc delete secret etcd-signer -n openshift-etcd > >... wait for the rollout ... > >Replace the old signer with the new signer: > >> $ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f - > > > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 8 months ago

@tjungblu: This pull request references ETCD-535 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1200): >/hold > >The cool thing is that we're now able to "swap" signers with the existing logic with: > >> $ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f - > >which is effectively overwriting the new signer from `openshift-etcd` into the old signer in `openshift-config`. That works, because the bundle with the new signer is already distributed to all CP nodes. CEO will then proceed to rewrite all leaf certs, which are then rolled out together via `etcd-all-certs`. > >Manual rotation is then just two step manual process: > >Generate new signer: >> $ oc delete secret etcd-signer -n openshift-etcd > >... wait for the rollout ... > >Replace the old signer with the new signer: > >> $ oc get secret etcd-signer -n openshift-etcd -ojson | jq 'del(.metadata["namespace","creationTimestamp","resourceVersion","selfLink","uid"])' | oc apply -n openshift-config -f - > > > > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
tjungblu commented 8 months ago

/test e2e-operator /test unit

tjungblu commented 8 months ago

/hold cancel

hasbro17 commented 8 months ago

/lgtm /retest-required

openshift-ci[bot] commented 8 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [hasbro17,tjungblu] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 8 months ago

/retest-required

Remaining retests: 0 against base HEAD 479c2c783f3c652daeccd82248b66ff80d252e92 and 2 for PR HEAD f7ab5384a2da7a8c392d64cd711fec08f05274b9 in total

openshift-ci[bot] commented 8 months ago

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-qe-no-capabilities f7ab5384a2da7a8c392d64cd711fec08f05274b9 link false /test e2e-gcp-qe-no-capabilities
ci/prow/e2e-aws-etcd-recovery f7ab5384a2da7a8c392d64cd711fec08f05274b9 link false /test e2e-aws-etcd-recovery

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 8 months ago

/retest-required

Remaining retests: 0 against base HEAD a2747fcf2db584d582faa7be533dfdcb00134b7b and 1 for PR HEAD f7ab5384a2da7a8c392d64cd711fec08f05274b9 in total

tjungblu commented 8 months ago

/override ci/prow/e2e-operator-fips

unrelated OLM failure

openshift-ci[bot] commented 8 months ago

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-operator-fips

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1200#issuecomment-1997536027): >/override ci/prow/e2e-operator-fips > >unrelated OLM failure Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tjungblu commented 8 months ago

I'm going to set this label to not let a full week of CI run data go to waste - sorry it took so long to retest and eventually get here

/label acknowledge-critical-fixes-only

tjungblu commented 8 months ago

sigh, another retest for today :yawning_face:

openshift-bot commented 8 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-etcd-operator-container-v4.16.0-202403180813.p0.geeef803.assembly.stream.el9 for distgit cluster-etcd-operator. All builds following this will include this PR.