ukaea / piezo

1 stars 0 forks source link

Create taints and tolerations restricting driver pods to only run on a singular node #137

Open oliver-tarrant-tessella opened 5 years ago

oliver-tarrant-tessella commented 5 years ago

Apply kubernetes style tainting and tolerations so that the web app and driver pods are restricted on which pods they run. This should ensure that there is always room for executor pods to run and solve issues related to systems going down under high workloads.

Acceptance criteria

oliver-tarrant-tessella commented 5 years ago

At high usage the Kubernetes cluster is liable to filling up with driver pods before having time to run executor pods. If the cluster runs out of resources to run the executor pods they will hang in pending and no jobs will run. To avoid this issue we need to allocate a certain part of the cluster just for executor pods. This will mean that jobs will always be able to run and the cluster should never hang without be able to run scheduled jobs.

Taint

The first thing to do to implement this would be to taint a selection of the Kubernetes nodes. This prevents pods from running on these nodes unless the pods have a toleration specified for the taint. This can be performed by running the following:

kubectl taint nodes {name of node to taint} piezoRestriction=executors:NoSchedule

With this applied the nodes will only schedule pods that include the following toleration in their manifest:

tolerations:
- key: "piezoRestriction"
  operator: "Equal"
  value: "executors"
  effect: "NoSchedule"

For more information see here

Tolerate the executor pods

To apply the above toleration to the executor pods the following should be included in the manifest created by the manifest populator:

"spec" {
  "executor" {
    "tolerations": {
      "key": "piezoRestriction",
      "operator": "Equal",
      "value": "executors",
      "effect": "NoSchedule"}}}

For more information see here

Note apply the tolerations to the executor pods won't force them to run on the tainted nodes and they will still also run on other nodes if there are resources available. It prevents. Therefore if the taints aren't in place the system will still function normally but when the taints are applied these nodes will be for use only by the executor pods and any other pods which are given the tolerations.

oliver-tarrant-tessella commented 5 years ago

Work for this has been started and remains in pull request. The code has not been tested and the system tests must be run successfully before the branch can be merged in. See the status of the pull request here: https://github.com/ukaea/piezo/pull/143