openshift / hypershift

Hyperscale OpenShift - clusters with hosted control planes
https://hypershift-docs.netlify.app
Apache License 2.0
434 stars 319 forks source link

Proposal: Enable control plane node exclusivity through built-in well-known tolerations #371

Closed ironcladlou closed 3 years ago

ironcladlou commented 3 years ago

We anticipate the need for users to dedicate nodes to control planes for various reasons (e.g. https://github.com/openshift/hypershift/issues/237 regarding noisy neighbors, resource requirements, etc.). One way we expect users to achieve node exclusivity is through the use of taints and tolerations.

This proposal is to bake in some default tolerations to each control plane component which define well-known/predictable values which can be leveraged by users to add corresponding taints to their nodes to achieve node exclusivity for any set of hosted control planes.

Each control plane component would have a default toleration set like:

tolerations:
- key: "hypershift.openshift.io/control-plane"
  operator: "Equal"
  value: "true"
  effect: NoSchedule
- key: "hypershift.openshift.io/$cluster-id"
  operator: "Equal"
  value: "true"
  effect: NoSchedule

The idea being that the end user could taint nodes in such a way that the nodes are dedicated to hosted clusters generally, or any subset (1 or many) of hosted clusters based on their ID.

TODO: Decide what is the right value to use for the unique cluster ID (e.g. hostedcluster UUID, namespace+name tuple, etc.) TODO: Define the exhaustive list of control plane components which to which these defaults will be applied.

sjenning commented 3 years ago

This is always something I feel the need to remember whenever the topic of taints/tolerations comes up.

A pod tolerating a taint means the pod can be scheduled to the tainted node, not that it will be. In order to confine the scheduling to a set of nodes, nodeSelector and Node labels must be used. The taint just prevents pods with no nodeSelector from scheduling onto the dedicated nodes.

ironcladlou commented 3 years ago

This is always something I feel the need to remember whenever the topic of taints/tolerations comes up.

A pod tolerating a taint means the pod can be scheduled to the tainted node, not that it will be. In order to confine the scheduling to a set of nodes, nodeSelector and Node labels must be used. The taint just prevents pods with no nodeSelector from scheduling onto the dedicated nodes.

Great points. I guess the implicit assumption in the requirements as they've been discussed is that there's no explicit confinement via node selectors in this use case, only exclusionary methods.

However, we also have https://github.com/openshift/hypershift/issues/370 which will induce node colocation by default. Does that change the calculus here? Otherwise, what do you propose? Another thought is we could do a soft node affinity for nodes labelled specific to the cluster.

Seems like we should be able to describe in terms of test scenario data (hostedcluster, node state, etc) what the expectations are to see whether we can satisfy them...

cc @relyt0925 @csrwng

ironcladlou commented 3 years ago

So what I mean here is, if we had both a default hosted-cluster-scoped toleration and soft node affinity rule, these would be benign out of the box, but if the admin wants to make a node exclusive to the hosted cluster, they could label the node with the well-known label matching the default affinity rule (to get the pods in right place) and taint the node with the well-known key for that cluster to prevent other clusters from being scheduled.

relyt0925 commented 3 years ago

I am fine doing a soft node affinity rule as well to prefer the isolated nodes by default. I think it's a good addition.

relyt0925 commented 3 years ago

also fine with names.

relyt0925 commented 3 years ago

Every component that supports a master should tolerate these tolerations

catalog-operator                 0/1     0            0           52m
certified-operators-catalog      1/1     1            1           52m
cloud-controller-manager         3/3     3            3           50m
cluster-api                      1/1     1            1           52m
cluster-autoscaler               1/1     1            1           52m
cluster-policy-controller        3/3     3            3           52m
cluster-version-operator         1/1     1            1           52m
community-operators-catalog      1/1     1            1           52m
control-plane-operator           1/1     1            1           52m
hosted-cluster-config-operator   1/1     1            1           51m
ignition-server                  1/1     1            1           52m
konnectivity-agent               1/1     1            1           52m
konnectivity-server              1/1     1            1           52m
kube-apiserver                   3/3     3            3           52m
kube-controller-manager          3/3     3            3           52m
kube-scheduler                   3/3     3            3           52m
oauth-openshift                  3/3     3            3           52m
olm-operator                     1/1     1            1           51m
openshift-apiserver              3/3     3            3           52m
openshift-controller-manager     3/3     3            3           52m
openshift-oauth-apiserver        3/3     3            3           52m
packageserver                    2/2     2            2           52m
redhat-marketplace-catalog       1/1     1            1           52m
redhat-operators-catalog         1/1     1            1           52m

In addition the manifest-bootstrapper pod and any one time creation pods (machine config server, image lookup if it doesn't go away) should tolerate as well

relyt0925 commented 3 years ago

/assign @csrwng

ironcladlou commented 3 years ago

Tracked in https://issues.redhat.com/browse/HOSTEDCP-193