Open dennischristmann opened 1 year ago
Potentially add a e2e
test for retry with workspaces.
@dennischristmann how many workspaces is your pipeline configured to run with?
Hey @pritidesai ,
we only have one workspace in this pipeline. It is created in the pipeline's TriggerTemplate as follows:
workspaces:
- name: ws
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: ebs-no-backup
But I don't think that the actual issue is really caused by the workspace. It only provokes that multiple Pods run for the same TaskRun because the first-attempt Pod remains in PENDING for a couple of minutes. This could probably also happen if there is another source of delay like long provisioning times or image pull delays. If the first-attempt Pod would start faster this Pod would not get evicted and the Tekton controller and kube-scheduler would not re-schedule twice.
We are using Tasks in Pipelines with retries. In rare situations, a retry Pod is created, although the Pod of the first TaskRun is not yet terminated. At the end, this leads to the situation, in which the complete PipelineRun runs into a timeout, because the TaskRun does not provide a response.
Expected Behavior
Tekton should not create a retry Pod if a previous attempt of the TaskRun is still running.
Actual Behavior
I am not completely sure, which component re-schedules the first-attempt Pod on a second node, but I guess it's kube-scheduler. However, I don't think that having 2 Pods for the same TaskRun is good in general. In my opinion, Tekton should not create a retry of the TaskRun if the Pod of the previous attempt isn't really really dead.
Steps to Reproduce the Problem
Unfortunately, I could not yet reproduce the situation completely. It only happens infrequently. Maybe some race condition plays some role here, too.... perhaps point 3 from above is crucial, i.e., that an evicted Pod of the TaskRun starts running but does not run to completion. Maybe this leads the tekton controller to believe that the first-attempt Pod really terminates unsuccessfully without being re-scheduled by some other kubernetes mechanisms.
Additional Info