tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.45k stars 1.77k forks source link

Tekton incompatible with Karpenter #7500

Open alisonjenkins opened 9 months ago

alisonjenkins commented 9 months ago

Expected Behavior

When there are resources for build pods required the pods will be left to the Kubernetes scheduler to find resources for the pods.

This is a requirement of Karpenter to find capacity for pods when there is insufficient capacity for the requested resources.

Actual Behavior

image

Steps to Reproduce the Problem

  1. Setup Karpenter on AWS so that it can schedule nodes.
  2. Request a task that has a shape that won't fit inside existing nodes.

Additional Info

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.8-eks-8cb36c9", GitCommit:"fca3a8722c88c4dba573a903712a6feaf3c40a51", GitTreeState:"clean", BuildDate:"2023-11-22T21:52:13Z", GoVersion:"go1.20.11", Compiler:"gc", Platform:"linux/amd64"}
Client version: 0.22.0
Pipeline version: v0.54.0
Triggers version: v0.25.3
Dashboard version: v0.42.0

Karpenter version: 0.33.0

This behaviour is preventing us from having scale to 0 build resources. If Tekton allowed Karpenter to schedule the pods as it does for everything else it would be able to dynamically create nodes that fit the requested computer requirements and then when the tasks had finished the nodes would be able to be scaled down.

To work around this problem I have had to setup overprovisioning pods which means that not only does it mean that we are paying for the capacity even when nothing is building.

vdemeester commented 9 months ago

👋🏼 @alisonjenkins. Indeed, today tektoncd/pipeline will fail quickly in cases such as ExceededNodeResources because, usually, we want to fail early (and we do not have any guarantee something like a cluster auto scaler is available in the cluster).

I do see a lot of value supporting these cases though. Scaling to 0 build resources and using Tekton alongside Karpenter should be something possible. I think we would want an opt-in or opt-out configuration to give some time for the Pod to be scheduled so that it would work with Karpenter, in case of some possible recoverable errors (such as ExceededNodeResources). I think it would allow usage of Tekton and Karpenter together without any having to know the other.

alisonjenkins commented 9 months ago

👋🏼 @alisonjenkins. Indeed, today tektoncd/pipeline will fail quickly in cases such as ExceededNodeResources because, usually, we want to fail early (and we do not have any guarantee something like a cluster auto scaler is available in the cluster).

I do see a lot of value supporting these cases though. Scaling to 0 build resources and using Tekton alongside Karpenter should be something possible. I think we would want an opt-in or opt-out configuration to give some time for the Pod to be scheduled so that it would work with Karpenter, in case of some possible recoverable errors (such as ExceededNodeResources). I think it would allow usage of Tekton and Karpenter together without any having to know the other.

Indeed, we need some way to toggle the behaviour off so that Karpenter can do it's thing in clusters where it is present. Preferably on a per PipelineRun basis as it is possible that there may be Karpenter present on the cluster but not in use on the nodes that the user wishes to build against (in which case the current behaviour would be the desired behaviour).

movinfinex commented 1 month ago

I think we would want an opt-in or opt-out configuration to give some time for the Pod to be scheduled so that it would work with Karpenter, in case of some possible recoverable errors (such as ExceededNodeResources). I think it would allow usage of Tekton and Karpenter together without any having to know the other.

It's also a workload-dependent thing. You might want fail-fast behaviour in some cases, but if you have a large number of low-priority batch jobs to run on a small number of worker nodes, you'd probably prefer them to each wait until a node is ready.

How about controlling the behaviour with an annotation on the pod (propagated from taskruns, etc.)? E.g., if a pod has an annotation like tekton.dev/ifResourcesUnavailable: wait, then the pod is allowed to stay pending (until the usual task/pipeline timeout).

If we can decide on an annotation name/value, I can contribute a PR.