tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.48k stars 1.78k forks source link

Tekton creates retry Pod of a TaskRun though Pod of previous attempt has not terminated #7218

Open dennischristmann opened 1 year ago

dennischristmann commented 1 year ago

We are using Tasks in Pipelines with retries. In rare situations, a retry Pod is created, although the Pod of the first TaskRun is not yet terminated. At the end, this leads to the situation, in which the complete PipelineRun runs into a timeout, because the TaskRun does not provide a response.

Expected Behavior

Tekton should not create a retry Pod if a previous attempt of the TaskRun is still running.

Actual Behavior

  1. The Pod of the TaskRun (as part of a PipelineRun) is scheduled on a node X, but not yet started. It is PENDING for a couple of minutes, because it waits for a volume (actually the workspace = EBS volume, i.e. we are running on AWS) that is mounted on a different node (we disabled the node affinity assistant)
  2. The Pod, which is still PENDING, gets an EVICTED event, because Karpenter does some consolidation and decides to remove the node, on which the Pod was scheduled. Because the Pod is still PENDING, the do-not-evict annotation of the Pod (do-not-evict is an annotation to control the behavior of Karpenter) is ignored.... as far as I understand neither Karpenter nor the kube-scheduler have concerns to "evict" Pods that are not yet running.
  3. The Pod of the TaskRun starts running, because the volume is now available and can be mounted. However, because of the EVICTION event, the Pod does not run to completion - Maybe it is completely irrelevant that this first Pod starts running, but I am not sure...
  4. Now strange things start to happen: During a time period of 200ms, the first Pod and the Pod of the retry TaskRun (i.e., the second Pod has the suffix -retry1) are scheduled on two different nodes. Because of the volume/workspace required by the Pods, the retry Pod does not leave the PENDING state but waits for the volume until the entire PipelineRun has a timeout. However, the first-attempt Pod starts running after the volume is available but obviously is no longer monitored by Tekton.

I am not completely sure, which component re-schedules the first-attempt Pod on a second node, but I guess it's kube-scheduler. However, I don't think that having 2 Pods for the same TaskRun is good in general. In my opinion, Tekton should not create a retry of the TaskRun if the Pod of the previous attempt isn't really really dead.

Steps to Reproduce the Problem

Unfortunately, I could not yet reproduce the situation completely. It only happens infrequently. Maybe some race condition plays some role here, too.... perhaps point 3 from above is crucial, i.e., that an evicted Pod of the TaskRun starts running but does not run to completion. Maybe this leads the tekton controller to believe that the first-attempt Pod really terminates unsuccessfully without being re-scheduled by some other kubernetes mechanisms.

Additional Info

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.4-eks-2d98532", GitCommit:"3d90c097c72493c2f1a9dd641e4a22d24d15be68", GitTreeState:"clean", BuildDate:"2023-07-28T16:51:44Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Client version: 0.32.0
Chains version: v0.17.0
Pipeline version: v0.50.1
Triggers version: v0.25.0
Dashboard version: v0.39.0
Operator version: v0.68.0
pritidesai commented 1 year ago

Potentially add a e2e test for retry with workspaces.

@dennischristmann how many workspaces is your pipeline configured to run with?

dennischristmann commented 1 year ago

Hey @pritidesai ,

we only have one workspace in this pipeline. It is created in the pipeline's TriggerTemplate as follows:

 workspaces:          
  - name: ws
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 3Gi
        storageClassName: ebs-no-backup

But I don't think that the actual issue is really caused by the workspace. It only provokes that multiple Pods run for the same TaskRun because the first-attempt Pod remains in PENDING for a couple of minutes. This could probably also happen if there is another source of delay like long provisioning times or image pull delays. If the first-attempt Pod would start faster this Pod would not get evicted and the Tekton controller and kube-scheduler would not re-schedule twice.