Make the Node Controller optional

displague commented 3 years ago

In some settings, the LoadBalancer helper functionality of CPEM is desired while the Node labeling is not.

Introduce a flag to disable the Node controller.
The Service controller must be allowed to function independently of any labels, node spec values, or annotations that the Node controller would produce.

Originally from https://github.com/equinix/terraform-metal-anthos-on-baremetal/issues/56#issuecomment-800264482

displague commented 3 years ago

@deitch - we've been talking about node annotations to propagate information needed for the LoadBalancer (BGP settings), this issue might be affected by that. Presumably we have existing dependencies that this issue would have to work out.

If the Service controller depends on the Node controller, we can't offer a toggle to turn off the node controller (or stop setting the providerID, specifically).

deitch commented 3 years ago

I don't understand this. Why do we want to disable the node controller, and node labeling? The idea of the node controller being distinct from the services one is something we constructed internally, but they do go together.

What is the need?

displague commented 3 years ago

@deitch In Anthos, a baremetal CCM node controller wants to take this responsibility, but it can't if the node providerId has been assumed by another CCM node controller.

https://github.com/equinix/terraform-metal-anthos-on-baremetal/issues/56#issuecomment-800264482 discusses the need, this idea was based on your suggestion :-)

deitch commented 3 years ago

Oh yes, now I remember. Just because I suggested it doesn't mean I would have any memory of it.

I didn't love the approach, but it seemed the only possible one (short of Anthos actually working nicely with official cloud provider CCMs).

What precisely would we want CCM to do and not to do?

displague commented 3 years ago

What precisely would we want CCM to do and not to do?

That's the question. If we made the node controller optional (a simple flag that keeps the node controller out of the manager), what would break? I think those are the issues we need to solve for.

We may have to assume that with the node controller functionality intentionally disabled, conventional node annotations (https://kubernetes.io/docs/reference/labels-annotations-taints/#nodekubernetesioinstance-type) could be the responsibility of another CCM. Perhaps in this case, EM uses an alternative label/annotation name to identify the instance-type, topology, etc.

We probably shouldn't guess about what a competing CCM wants to manage. We may need more information here.

One of the challenges would be in providing BGP information to the cluster.

Referring to the CCM list of responsibilities:

* Node controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.
* Service controller - responsible for loadbalancers on your cloud against services of type LoadBalancer.
* Route controller - responsible for setting up network routes on your cloud
* any other features you would like to implement if you are running an out-of-tree provider.

I wonder if there is lexical wiggle room to migrate node annotation functionality to an "any other features" controller, or perhaps use the "router controller" to serve this purpose (I don't know what facilities this controller is expected to offer and if this would be a good fit).

Perhaps Metadata controller - responsible for updating kubernetes nodes (secrets, and/or configmaps) with BGP and IP configuration and secrets discovered through Equinix Metal metadata.

The metadata service is not available (for now) without public addresses, so this may not be a great solution. We can't even assume 1 node in the cluster would have public addresses. Then again, the EM API functionality in this CCM is broken without public addresses, so layer2 only is not a supported workflow.

Perhaps BGP controller - responsible for updating kubernetes nodes (annotations, and secrets, and/or configmaps) with BGP configuration and secrets discovered through the Equinix Metal API.

displague commented 3 years ago

Maybe we can wait this problem out :-) https://github.com/equinix/terraform-metal-anthos-on-baremetal/issues/54#issuecomment-821816369

(I think we should still figure this out, in the meantime, since it may be related to the BGP annotations and will help us keep SoC and independence in our controllers)

deitch commented 3 years ago

This also came up in the CAPP/CPEM upgrade discussion (cc @detiber ). There are two distinct conversations going on here:

enabling some flexibility around what the provider ID should be. I instinctively dislike this, but I recognize that it isn't all that hard to do, and doesn't go against the internal design of CP. It simply moves it from hard-coding to default+options
re-architecting the various controllers, what each one does, what is required vs optional, etc.

Anthos, IIRC, has had a bit of a hard time with this, and actually ended up copying and modifying their own versions of each CSP's CCM. That is not a route I would want to go down; I would sooner work with whichever SIG managed cloud-provider and see if we can standardize these capabilities.

We should be open to being more flexible than the official CP standards, as long as we don't actually go against it.

If we can come up with a better design, all for it.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/156#issuecomment-2005506212): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cloud-provider-equinix-metal

Make the Node Controller optional #156