tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.45k stars 1.77k forks source link

Pod creation issue on Task retry with `workspace.<name>.volume` settings #7886

Closed l-qing closed 5 months ago

l-qing commented 6 months ago

Expected Behavior

The pipeline should be able to retry normally.

Actual Behavior

When retrying, the pipeline is unable to create the Pod for the retry.

failed to create task run pod "test-sidecar-workspace-run": Pod "test-sidecar-workspace-run-pod-retry1" is invalid: spec.containers[0].volumeMounts[0].name: Not found: "ws-jchh2". Maybe missing or invalid Task default/test-sidecar-workspace

Steps to Reproduce the Problem

  1. Create the following Task and TaskRun.

    cat <<'EOF' | kubectl replace --force -f -
    apiVersion: tekton.dev/v1
    kind: Task
    metadata:
    name: test-sidecar-workspace
    spec:
    workspaces:
    - description: cache
    name: cache
    steps:
    - name: command
    image: alpine
    volumeMounts:
    - mountPath: /cache
      name: $(workspaces.cache.volume)   # <-- It's this configuration that caused the above problem.
    script: |
      #!/bin/sh
      set -ex
      echo "hello world"
      exit 1
    ---
    apiVersion: tekton.dev/v1
    kind: TaskRun
    metadata:
    name: test-sidecar-workspace-run
    spec:
    taskRef:
    name: test-sidecar-workspace
    retries: 1
    workspaces:
    - name: cache
    emptyDir: {}
    EOF
  2. Wait for the TaskRun to be processed.

    
    apiVersion: tekton.dev/v1beta1
    kind: TaskRun
    metadata:
    generation: 1
    labels:
    app.kubernetes.io/managed-by: tekton-pipelines
    tekton.dev/task: test-sidecar-workspace
    name: test-sidecar-workspace-run
    spec:
    retries: 1
    serviceAccountName: default
    taskRef:
    kind: Task
    name: test-sidecar-workspace
    timeout: 1h0m0s
    workspaces:
    - emptyDir: {}
      name: cache
    status:
    completionTime: "2024-04-16T09:58:14Z"
    conditions:
    - lastTransitionTime: "2024-04-16T09:58:14Z"
      message:
        'failed to create task run pod "test-sidecar-workspace-run": Pod "test-sidecar-workspace-run-pod-retry1"
        is invalid: spec.containers[0].volumeMounts[0].name: Not found: "ws-jchh2".
        Maybe missing or invalid Task default/test-sidecar-workspace'
      reason: PodCreationFailed
      status: "False"
      type: Succeeded
    podName: ""
    provenance:
    featureFlags:
      AwaitSidecarReadiness: true
      Coschedule: workspaces
      DisableAffinityAssistant: false
      DisableCredsInit: false
      EnableAPIFields: beta
      EnableArtifacts: false
      EnableCELInWhenExpression: false
      EnableKeepPodOnCancel: false
      EnableParamEnum: false
      EnableProvenanceInStatus: true
      EnableStepActions: false
      EnableTektonOCIBundles: false
      EnforceNonfalsifiability: none
      MaxResultSize: 4096
      RequireGitSSHSecretKnownHosts: false
      ResultExtractionMethod: termination-message
      RunningInEnvWithInjectedSidecars: true
      ScopeWhenExpressionsToTask: false
      SendCloudEventsForRuns: false
      SetSecurityContext: false
      VerificationNoMatchPolicy: ignore
    retriesStatus:
    - completionTime: "2024-04-16T09:58:14Z"
      conditions:
        - lastTransitionTime: "2024-04-16T09:58:14Z"
          message: '"step-command" exited with code 1'
          reason: Failed
          status: "False"
          type: Succeeded
      podName: test-sidecar-workspace-run-pod
      provenance:
        featureFlags:
          AwaitSidecarReadiness: true
          Coschedule: workspaces
          DisableAffinityAssistant: false
          DisableCredsInit: false
          EnableAPIFields: beta
          EnableArtifacts: false
          EnableCELInWhenExpression: false
          EnableKeepPodOnCancel: false
          EnableParamEnum: false
          EnableProvenanceInStatus: true
          EnableStepActions: false
          EnableTektonOCIBundles: false
          EnforceNonfalsifiability: none
          MaxResultSize: 4096
          RequireGitSSHSecretKnownHosts: false
          ResultExtractionMethod: termination-message
          RunningInEnvWithInjectedSidecars: true
          ScopeWhenExpressionsToTask: false
          SendCloudEventsForRuns: false
          SetSecurityContext: false
          VerificationNoMatchPolicy: ignore
      startTime: "2024-04-16T09:58:06Z"
      steps:
        - container: step-command
          imageID: docker.io/library/alpine@sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b
          name: command
          terminated:
            containerID: containerd://1beacfc9852b3a0c80eb30494bf80691d67533136e35f599d6bb62fe416bdfd1
            exitCode: 1
            finishedAt: "2024-04-16T09:58:13Z"
            reason: Error
            startedAt: "2024-04-16T09:58:13Z"
      taskSpec:
        steps:
          - image: alpine
            name: command
            resources: {}
            script: |
              #!/bin/sh
              set -ex
              echo "hello world"
              exit 1
            volumeMounts:
              - mountPath: /cache
                name: ws-jchh2
        workspaces:
          - description: cache
            name: cache
    startTime: "2024-04-16T09:58:14Z"
    steps:
    - container: step-command
      imageID: docker.io/library/alpine@sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b
      name: command
      terminated:
        containerID: containerd://1beacfc9852b3a0c80eb30494bf80691d67533136e35f599d6bb62fe416bdfd1
        exitCode: 1
        finishedAt: "2024-04-16T09:58:13Z"
        reason: Error
        startedAt: "2024-04-16T09:58:13Z"
    taskSpec:
    steps:
      - image: alpine
        name: command
        resources: {}
        script: |
          #!/bin/sh
          set -ex
          echo "hello world"
          exit 1
        volumeMounts:
          - mountPath: /cache
            name: ws-jchh2
    workspaces:
      - description: cache
        name: cache

# Additional Info

- Kubernetes version:

  **Output of `kubectl version`:**

Client Version: v1.29.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.8


- Tekton Pipeline version:

  **Output of `tkn version` or `kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'`**

Client version: 0.36.0 Pipeline version: v0.58.0



<!-- Any other additional information -->
# Analysis

### 0. Summary

1. During the first reconciliation, the variable `workspace.<name>.volume` in the taskSpec was replaced with a random value.
2. The modified taskSpec was then stored in the `taskRun.status.taskSpec`.
3. When retrying, the replaced value of taskSpec from the status was used directly.
4. However, in `CreateVolumes`, the workspace’s name was generated again.
5. The newly generated volume name this time was inconsistent with the first one, which caused the above issue.

### 1. reconcile -> CreateVolumes && applyParamsContextsResultsAndWorkspaces && createPod
https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/reconciler/taskrun/taskrun.go#L617-L635

### 2.1 CreateVolumes -> RestrictLengthWithRandomSuffix 
https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/workspace/apply.go#L45-L82

### 2.2 RestrictLengthWithRandomSuffix -> random string
https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/names/generate.go#L54-L61

### 3.1 applyParamsContextsResultsAndWorkspaces -> ApplyWorkspaces

https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/reconciler/taskrun/taskrun.go#L905-L946

### 3.2 ApplyWorkspaces -> ApplyReplacements

https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/reconciler/taskrun/resources/apply.go#L277-L310

### 3.3 `workspaces.%s.volume`
https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/reconciler/taskrun/resources/apply.go#L299-L301

### 4. Set replaced taskSpec to TaskRun Status
https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/reconciler/taskrun/taskrun.go#L625

### 5. reconcile -> prepare -> GetTaskFuncFromTaskRun
https://github.com/tektoncd/pipeline/blob/f0a1d64aa88929f8915f208db65a8f731b5c92e2/pkg/reconciler/taskrun/resources/taskref.go#L55-L79

**if the spec is already in the status, do not try to fetch it again, just use it as source of truth.**

# Solution

### 1. The content stored in taskRun.status.taskSpec is the original content of taskSpec, rather than the values after variable substitution.

Advantages: It's very versatile and there is no need to worry about retries being affected even if similar random variables are introduced in the future.

Disadvantages: The information in `taskus.taskSpec` is no longer intuitive. It's impossible to directly see the actual variable values during execution.

### 2. Each time the taskSpec definition is acquired, it is fetched from the remote end, rather than reusing the one in the status.

Disadvantages: It needs to be fetched every time reconciliation occurs, which is not very practical.

### 3. Change the method of generating volume names
No longer be completely random, but rather to hash the original name. Make sure that the name generated each time is the same and is not duplicated within the current workspaces.

In the current taskSpec, only the value of `workspace.<name>.volume` may be different each time; the values of other variables are fixed and will not change anymore.

**I personally prefer to fix it using this solution.**

### 4. Remove support for `workspace.<name>.volume`
l-qing commented 6 months ago

/assign

tricktron commented 4 months ago

@l-qing I am facing this issue as well: Is there any manual workaround? Will this be backported to older stable versions?

l-qing commented 4 months ago

I am facing this issue as well: Is there any manual workaround? Will this be backported to older stable versions?

I didn't find a manual solution.
I thought it was a low probability issue, so I didn't consider reverting to the old stable version.
If you think it's necessary, you can ask @vdemeester for help.