Handle failure of node with both EIP and CPEM

deitch commented 2 years ago

As described in #304, which was resolved for most failure scenarios, there is one scenario not yet handled.

If the control plane node that has CPEM as the leader also is the node the EIP is pointing to (via API controls of EIP using CPEM, not when using BGP), and that node fails, then everything is stuck.

the EIP cannot go to a new node, since CPEM is responsible for moving it, and CPEM leader is down
CPEM cannot elect a new leader for one of the other running pods, since leader election relies on Kubernetes apiserver, which routes to the EIP, which is down

There are several potential solutions, looking for more or better thoughts on these.

Deprecate EIP-management for control plane via CPEM.
Keep support but mark this as an unsupported failure mode.
Start a separate container in the same pod in the DaemonSet that manages EIP. This is very challenging because they need to coordinate, which may mean raft or other consensus protocol, which we really do not want to do.
Try to solve the issue that causes the apiserver connection in the cluster to fail when EIP connection fails, even though 2 other nodes with apiserver are working fine.

The latter has to do with how kubeadm initiates nodes. I believe that if it changed (or had an option to do so), it would resolve it. I am going to run tests and see if that is the case.

deitch commented 2 years ago

I completed my tests. I did the following:

Deployed a 3-node cluster using kubeadm
After the cluster was up, changed the /etc/kubernetes/kubelet.conf (which is the kubelet's kubeconfig file) to point to its local IP rather than the EIP
systemctl restart kubelet
removed the EIP entirely

I then tried to connect to the service default/kubernetes from various pods. Prior to these changes, this connection failed, which is why leader election failed, and hence CPEM could not recover even though it had 3 copies running.

With the above change:

each kubelet continues to function normally even if EIP is gone
each etcd (and the cluster) continues to function normally (as expected)
each apiserver continues to function normally
most importantly: stuff that depends on the service default/kubernetes continues to reach it, which means:
CPEM leader election works, which means:
the failure case goes away

But kubeadm does not support it for now. The right answer here is to follow up with kubeadm.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/319#issuecomment-2005590464): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cloud-provider-equinix-metal

Handle failure of node with both EIP and CPEM #319