kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.58k stars 1.31k forks source link

During KCP rollout, etcd churn can prevent renewal of the bootstrap token causing KCP node registration to fail #5477

Closed randomvariable closed 6 months ago

randomvariable commented 3 years ago

/kind bug

Follow up from #2770

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it's more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout.

Due to a mixture of https://github.com/kube-vip/kube-vip/issues/214 and the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication completes but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup.

What did you expect to happen:

Rollout succeeds.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

It might be worth in these instances to have a configurable bootstrap token TTL. We can't account for every type of environment with a longer TTL so we should make it configurable and vendors and cluster operators can decide, since there is a security trade off.

Furthermore, VMware would like to request a backport to v0.3.x branch.

Environment:

sbueringer commented 3 years ago

+1 for a configurable timeout

neolit123 commented 3 years ago

It seems to me too many things are happening at once?

Would it make sense for upgradeAfter to trigger a more step by step operation with bs token renewal being the first task, then the rest...etc. On Oct 22, 2021 15:56, "Naadir Jeewa" @.***> wrote:

/kind bug

Follow up from #2770 https://github.com/kubernetes-sigs/cluster-api/issues/2770

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Deploy a 3-node CP cluster using CAPV with kube-vip using PhotonOS (for some reason, it's more likely to occur here) and then set kcp.spec.upgradeAfter to trigger a rollout.

Due to a mixture of kube-vip/kube-vip#214 https://github.com/kube-vip/kube-vip/issues/214 and the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times, with kube-vip leader election also rotating. During this time, CAPI controllers are unable to fully reconcile, and neither can kubelet register nodes. Importantly, CABPK is also unable to renew the bootstrap token. Eventually, etcd replication but after the 15 minute bootstrap token timeout. kubelet node registration ultimately fails and we end up with an orphaned control plane machine which is a valid member of the etcd cluster, complicating cleanup.

What did you expect to happen:

Rollout succeeds.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

It might be worth in these instances to have a configurable bootstrap token TTL. We can't account for every type of environment with a longer TTL so we should make it configurable and vendors and cluster operators can decide, since there is a security trade off.

Furthermore, VMware would like to request a backport to v0.3.x branches.

Environment:

  • Cluster-api- version: v0.3.22
  • Kubernetes version: (use kubectl version): v1.21
  • OS (e.g. from /etc/os-release): Photon 3

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/cluster-api/issues/5477, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRATE4PK7JOK76BV27YZTUIFNQVANCNFSM5GQP5HUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

fabriziopandini commented 3 years ago

FYI CABPK controller manager has a --bootstrap-token-ttl flag

neolit123 commented 3 years ago

the fact we haven't yet implemented etcd learner mode in kubeadm or have full support in kubernetes, etcd leader switches around many times

Hm, would learner mode really solve this...i wonder if this is more of a problem about pinning the leader programmatically. This is something that k8s wishes to add in the future. There is a kep about it.

fabriziopandini commented 3 years ago

Also, AFAIK KCP move leadership only if the current leader is being deleted, so we are already very conservative in doing this operation and so I don't see a direct connection between current process and etcd leader switching around many times.

randomvariable commented 3 years ago

Would it make sense for upgradeAfter to trigger a more step by step operation with bs token renewal being the first task

We generate a token per machine, based on a Machine resource being created, so KCP isn't in the loop at that stage, just CABPK.

FYI CABPK controller manager has a --bootstrap-token-ttl flag

I think this means we won't need the backport, and I've filed https://github.com/vmware-tanzu/tanzu-framework/issues/954

Hm, would learner mode really solve this...i wonder if this is more of a problem about pinning the leader programmatically. This is something that k8s wishes to add in the future. There is a kep about it.

Any pointers on that would be handy.

randomvariable commented 3 years ago

Also, KCP move leadership only if the current leader is being deleted, so we are already very conservative in doing this operation and so I don't see a direct connection between current process and etcd leader switching around many times.

There's an interaction between the etcd leader moving and kube-vip which is using kubernetes resource locking, which makes things esp. problematic.

randomvariable commented 3 years ago

Interestingly, @voor reports similar behaviour on AWS when cross-zone load balancing enabled.

neolit123 commented 3 years ago

Any pointers on that would be handy.

Here is the kep / issue. Still WIP. Although this is for k8s LE: https://github.com/kubernetes/enhancements/issues/2835

vincepri commented 3 years ago

/priority awaiting-more-evidence /assign @randomvariable /milestone Next

randomvariable commented 3 years ago

Been speaking to @gab-satchi about looking into some of these thorny etcd thingies. /assign @gab-satchi

randomvariable commented 3 years ago

/area control-plane

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

voor commented 2 years ago

/remove-lifecycle stale

vincepri commented 2 years ago

/lifecycle frozen

fabriziopandini commented 2 years ago

/triage accepted /unassign @randomvariable /unassign @gab-satchi

This should be re-assessed after we improved how token renewal is managed (tokens are now renewed until the machine joins); also in kubeadm we recently discussed to revive the work on etcd learner mode

fabriziopandini commented 2 years ago

/help

k8s-ci-robot commented 2 years ago

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/5477): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sbueringer commented 2 years ago

Quick update: kubeadm learner mode RFE:

chrischdi commented 2 years ago

Also related: had to increase the timeout here for self-hosted upgrade tests. From my analysis the reason here is also etcd churn and side-effects of it (e.g. leader-election failing which makes webhooks unavailable).

fabriziopandini commented 2 years ago

We should better sequence what's happening here. KCP move leadership to another node only before deleting the current leader, so we are already minimizing the number of forced leader changes / doing what is possible for them to happen in a controlled way. Everything else is raft periodic leader election that should be pretty stable, assuming the current leader answers timely. 🤔

sbueringer commented 1 year ago

Just fyi regarding etcd learner mode:

The alpha-level code is merged for v1.27 in https://github.com/kubernetes/kubernetes/pull/113318 https://github.com/kubernetes/kubeadm/issues/1793#issuecomment-1377722007

k8s-triage-robot commented 9 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini commented 7 months ago

/priority important-longterm

fabriziopandini commented 6 months ago

We now have a flag in CABPK for setting the TTL + we did a lot of improvements in the last two years. Also learner mode in kubeadm is making progress, even if not yet used by CAPI

/close

k8s-ci-robot commented 6 months ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/5477#issuecomment-2072205993): >We now have a flag in CABPK for setting the TTL + we did a lot of improvements in the last two years. >Also learner mode in kubeadm is making progress, even if not yet used by CAPI > >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.