grid-dev commented 2 years ago

Not enough "slots" for pods when affinity assistants allocates together with Cluster Autoscaler

Expected Behavior

EKS cluster exists and has the following setup
- Cluster Autoscaler and
- 6 x t3.medium nodes in a node_group.
- The nodes are spread across 3 availability zones in one AWS region.
- Every node has labels assigned (see "K8s node labels")
Cluster nodes are packed and have between 12 and 17 pods – whereas 17 is the maximum for this instance type
PipelineRun is started consisting of 2 tasks which both share a workspace aka a volumneClaim (see "Pipeline YAML code")
affinity-assistant-... allocates needed pods including itself on a single node or at least region so volumneClaim can be shared.
If there is not enough space left for the needed "pods" the Cluster Autoscaler provisions a new node
All task's start and can bind to the volumneClaim - one after the other.
Pipeline finished successfully
If Cluster Autoscaler created a new node, this node is terminated again after the run was successfull.

Actual Behavior

Steps to Reproduce the Problem

Step 1 - 4 are the same as in "Actual Behavior"

affinity-assistant-3a0bc57d00-0 pod is started and persistentVolumneClaim is bound, but the pod for the first task go-lang-8txd7-git-pod is stuck in (see "Pod stuck event log")
Pod timeout as deadlock → failed

Additional Info

Pod stuck event log (go-lang-8txd7-git-pod)

Events:                                                                                                                                                                                                                                                             │
│   Type     Reason             Age                From                Message                                                                                                                                                                                      │
│   ----     ------             ----               ----                -------                                                                                                                                                                                      │
│   Warning  FailedScheduling   37s (x3 over 42s)  default-scheduler   0/6 nodes are available: 2 Too many pods, 4 node(s) didn't find available persistent volumes to bind.                                                                                        │
│   Normal   NotTriggerScaleUp  37s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't find available persistent volumes to bind                                                                                                      │
│   Warning  FailedScheduling   24s (x2 over 32s)  default-scheduler   0/6 nodes are available: 1 node(s) didn't match pod affinity rules, 1 node(s) didn't match pod affinity/anti-affinity rules, 2 Too many pods, 3 node(s) had volume node affinity conflict.   │

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:17:57Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version: v0.33.2
Tekton Triggers version: v0.17.0
Tekton Dashboard version: v0.24.1
K8s node labels

beta.kubernetes.io/arch=amd64                                                  
beta.kubernetes.io/instance-type=t3.medium                                  
beta.kubernetes.io/os=linux                                                 
capacity_type=ON_DEMAND                                                     
eks.amazonaws.com/capacityType=ON_DEMAND                                    
eks.amazonaws.com/nodegroup=managed-group-ondemand20220302164423693200000001
eks.amazonaws.com/nodegroup-image=ami-04d4d5e816895f43e                     
eks.amazonaws.com/sourceLaunchTemplateId=lt-0322b2edf9c5fb9f3               
eks.amazonaws.com/sourceLaunchTemplateVersion=1                             
environment=dev                                                             
failure-domain.beta.kubernetes.io/region=eu-central-1                       
failure-domain.beta.kubernetes.io/zone=eu-central-1a                        
kubernetes.io/arch=amd64                                                    
kubernetes.io/hostname=ip-172-.....eu-central-1.compute.internal        
kubernetes.io/os=linux                                                      
node.kubernetes.io/instance-type=t3.medium                                  
org=dev                                                                     
tenant=fooBar                                                             
topology.ebs.csi.aws.com/zone=eu-central-1a                                 
topology.kubernetes.io/region=eu-central-1                                  
topology.kubernetes.io/zone=eu-central-1a

Pipeline YAML code

---
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: git
spec:
  params:
    - name: url
      description: Git repository URL to clone from
      type: string
    - name: branch
      description: Branch to use for cloning
      type: string
      default: main
  workspaces:
    - name: output
      description: The workspace containing the source code
  results:
    - name: GIT_COMMIT_HASH
      description: SHA hash for current commit from git
    - name: DATE_STRING
      description: Current date as hash
  steps:
    - name: clone
      image: alpine/git:v2.32.0@sha256:192d7b402bfd313757d5316920fea98606761e96202d850e5bdeab407b9a72ae
      workingDir: $(workspaces.output.path)
      script: |
        #!/usr/bin/env sh

        # -e  Exit on error
        # -u  Treat unset param as error
        set -eu

        # Fetch hash for latest commit (without cloning)
        # GIT_COMMIT_HASH="$(git ls-remote "$(params.url)" "$(params.branch)" | awk '{ print $1}')"

        # .. alternative
        git clone \
          --depth "1" \
          --single-branch "$(params.url)" \
          --branch "$(params.branch)" \
          tmp_repo

        cd tmp_repo

        GIT_COMMIT_HASH="$(git rev-parse HEAD)"
        DATE_STRING="$(date +"%Y-%m-%d_%H-%M-%S_%Z")"

        # Write data to result
        echo "GIT_$GIT_COMMIT_HASH" | tr -cd '[:alnum:]._-' > $(results.GIT_COMMIT_HASH.path)
        echo "DATE_$DATE_STRING" | tr -cd '[:alnum:]._-' > $(results.DATE_STRING.path)
---
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: kaniko
spec:
  params:
    - name: docker_registry
      description: Docker repository URL to write image into
      type: string
    - name: GIT_COMMIT_HASH
      description: SHA hash for current commit from git
      type: string
    - name: DATE_STRING
      description: Current date as hash
      type: string
  volumes:
    - name: aws-creds
      secret:
        secretName: aws-credentials
    - name: docker-configmap
      configMap:
        name: docker-config
  workspaces:
    - name: output
      description: The workspace containing the source code
  steps:
    - name: echo
      image: alpine/git:v2.32.0@sha256:192d7b402bfd313757d5316920fea98606761e96202d850e5bdeab407b9a72ae
      workingDir: $(workspaces.output.path)/tmp_repo
      script: |
        #!/usr/bin/env sh

        # -e  Exit on error
        # -u  Treat unset param as error
        set -eu

        echo "List of image tags:"
        echo "|$(params.GIT_COMMIT_HASH)|"
        echo "|$(params.DATE_STRING)|"
    - name: build-push
      # https://github.com/GoogleContainerTools/kaniko
      image: gcr.io/kaniko-project/executor:v1.8.0@sha256:ff98af876169a488df4d70418f2a60e68f9e304b2e68d5d3db4c59e7fdc3da3c
      workingDir: $(workspaces.output.path)/tmp_repo
      command:
        - /kaniko/executor
      args:
        - --dockerfile=./Dockerfile
        - --context=$(workspaces.output.path)/tmp_repo
        - --destination=$(params.docker_registry):latest
        - --destination=$(params.docker_registry):$(params.GIT_COMMIT_HASH)
        - --destination=$(params.docker_registry):$(params.DATE_STRING)
        - --cache=true
        - --cache-ttl=720h # 1 month
        - --cache-repo=$(params.docker_registry)
      # kaniko assumes it is running as root, which means this example fails on platforms
      # that default to run containers as random uid (like OpenShift). Adding this securityContext
      # makes it explicit that it needs to run as root.
      securityContext:
        runAsUser: 0
      env:
        - name: "DOCKER_CONFIG"
          value: "/kaniko/.docker/"
      volumeMounts:
        - name: aws-creds
          mountPath: /root/.aws
        - name: docker-configmap
          mountPath: /kaniko/.docker/
---
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: go-lang
spec:
  params:
    - name: repo_url
      description: Repository URL to clone from.
      type: string
    - name: docker_registry
      description: Docker repository URL to write image into
      type: string
      default: 0190....dkr.ecr.eu-central-1.amazonaws.com/fooBar
  workspaces:
    - name: output
      description: The workspace containing the source code
  tasks:
    - name: git
      taskRef:
        name: git
      params:
        - name: url
          value: "$(params.repo_url)"
        - name: branch
          value: main
      workspaces:
        - name: output
          workspace: output
    - name: kaniko
      taskRef:
        name: kaniko
      runAfter:
        - git
      params:
        - name: docker_registry
          value: "$(params.docker_registry)"
        - name: GIT_COMMIT_HASH
          value: "$(tasks.git.results.GIT_COMMIT_HASH)"
        - name: DATE_STRING
          value: "$(tasks.git.results.DATE_STRING)"
      workspaces:
        - name: output
          workspace: output
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tekton-user
secrets:
  - name: bitbucket-ssh-key
  - name: aws-credentials
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: go-lang-
spec:
  timeout: 2m
  serviceAccountName: tekton-user
  pipelineRef:
    name: go-lang
  params:
    - name: repo_url
      value: git@bitbucket.org:fooBar/docker-test.git
  workspaces:
    - name: output
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

tekton-robot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

grid-dev commented 2 years ago

/remove-lifecycle stale As issue still persists

vdemeester commented 2 years ago

/remove-lifecycle stale

tekton-robot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

alex-souslik-hs commented 1 year ago

Did you consider disabling the affinity assistant? I'm currently experiencing this issue and would love some contributor input on this

icereed commented 1 year ago

Hello all, this is indeed a challenge. How does anybody use cluster autoscaler with Tekton successfully? Is everybody just statically provisioning nodes and burning money this way? I would love to see some How-To on how you can setup Cluster Autoscaler with Tekton (with some kind of volumeClaim)...

jleonar commented 1 year ago

@icereed My company uses cluster autoscaler, with an NFS server(in the k8s cluster) to serve NFS mounts for PVCs. We also disable affinity-assistant.

alanmoment commented 1 year ago

I have the same issue with this. The autoscaling is ok for some jobs which have the same node select label, but the jobs with the same label are not successfully running when resources are not enough.

update: When I disabled affinity-assistant, the node can scale but not run the Pod on the new Node. I guess still the volume problem.

lbernick commented 1 year ago

@grid-dev I'm not sure if this addresses your use case, but we've recently introduced some new options for the affinity assistant and would appreciate your feedback! Please feel free to weigh in on https://github.com/tektoncd/pipeline/issues/6990. Since you're using a cluster autoscaler w/ a limited number of pods per node I wonder if the "isolate-pipelineruns" option would work well for you? https://github.com/tektoncd/pipeline/blob/main/docs/affinityassistants.md#affinity-assistants

tektoncd / pipeline

Cluster Autoscaler conflict with volumneClaim and or affinity-assistant #4699

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info