Open grid-dev opened 2 years ago
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/remove-lifecycle stale As issue still persists
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle rotten
Send feedback to tektoncd/plumbing.
Did you consider disabling the affinity assistant? I'm currently experiencing this issue and would love some contributor input on this
Hello all, this is indeed a challenge. How does anybody use cluster autoscaler with Tekton successfully? Is everybody just statically provisioning nodes and burning money this way? I would love to see some How-To on how you can setup Cluster Autoscaler with Tekton (with some kind of volumeClaim)...
@icereed My company uses cluster autoscaler, with an NFS server(in the k8s cluster) to serve NFS mounts for PVCs. We also disable affinity-assistant.
I have the same issue with this. The autoscaling is ok for some jobs which have the same node select label, but the jobs with the same label are not successfully running when resources are not enough.
update: When I disabled affinity-assistant, the node can scale but not run the Pod on the new Node. I guess still the volume problem.
@grid-dev I'm not sure if this addresses your use case, but we've recently introduced some new options for the affinity assistant and would appreciate your feedback! Please feel free to weigh in on https://github.com/tektoncd/pipeline/issues/6990. Since you're using a cluster autoscaler w/ a limited number of pods per node I wonder if the "isolate-pipelineruns" option would work well for you? https://github.com/tektoncd/pipeline/blob/main/docs/affinityassistants.md#affinity-assistants
Not enough "slots" for pods when affinity assistants allocates together with Cluster Autoscaler
Expected Behavior
EKS cluster exists and has the following setup
Cluster nodes are packed and have between 12 and 17 pods – whereas 17 is the maximum for this instance type
PipelineRun is started consisting of 2 tasks which both share a workspace aka a volumneClaim (see "Pipeline YAML code")
affinity-assistant-...
allocates needed pods including itself on a single node or at least region so volumneClaim can be shared.If there is not enough space left for the needed "pods" the Cluster Autoscaler provisions a new node
All task's start and can bind to the volumneClaim - one after the other.
Pipeline finished successfully
If Cluster Autoscaler created a new node, this node is terminated again after the run was successfull.
Actual Behavior
Steps to Reproduce the Problem
Step 1 - 4 are the same as in "Actual Behavior"
affinity-assistant-3a0bc57d00-0 pod
is started and persistentVolumneClaim is bound, but the pod for the first taskgo-lang-8txd7-git-pod
is stuck in (see "Pod stuck event log")Additional Info