kubernetes-sigs / cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
https://deploy.equinix.com/labs/cloud-provider-equinix-metal
Apache License 2.0
74 stars 26 forks source link

Handle failure of node with both EIP and CPEM #319

Closed deitch closed 6 months ago

deitch commented 2 years ago

As described in #304, which was resolved for most failure scenarios, there is one scenario not yet handled.

If the control plane node that has CPEM as the leader also is the node the EIP is pointing to (via API controls of EIP using CPEM, not when using BGP), and that node fails, then everything is stuck.

There are several potential solutions, looking for more or better thoughts on these.

  1. Deprecate EIP-management for control plane via CPEM.
  2. Keep support but mark this as an unsupported failure mode.
  3. Start a separate container in the same pod in the DaemonSet that manages EIP. This is very challenging because they need to coordinate, which may mean raft or other consensus protocol, which we really do not want to do.
  4. Try to solve the issue that causes the apiserver connection in the cluster to fail when EIP connection fails, even though 2 other nodes with apiserver are working fine.

The latter has to do with how kubeadm initiates nodes. I believe that if it changed (or had an option to do so), it would resolve it. I am going to run tests and see if that is the case.

deitch commented 2 years ago

I completed my tests. I did the following:

  1. Deployed a 3-node cluster using kubeadm
  2. After the cluster was up, changed the /etc/kubernetes/kubelet.conf (which is the kubelet's kubeconfig file) to point to its local IP rather than the EIP
  3. systemctl restart kubelet
  4. removed the EIP entirely

I then tried to connect to the service default/kubernetes from various pods. Prior to these changes, this connection failed, which is why leader election failed, and hence CPEM could not recover even though it had 3 copies running.

With the above change:

  1. each kubelet continues to function normally even if EIP is gone
  2. each etcd (and the cluster) continues to function normally (as expected)
  3. each apiserver continues to function normally
  4. most importantly: stuff that depends on the service default/kubernetes continues to reach it, which means:
  5. CPEM leader election works, which means:
  6. the failure case goes away

But kubeadm does not support it for now. The right answer here is to follow up with kubeadm.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/319#issuecomment-2005590464): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.