tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.45k stars 1.77k forks source link

[REGRESSION] Unable to cancel pipeline run - error PipelineRunCouldntCancel reported on a pending custom task #5282

Closed rafalbigaj closed 2 years ago

rafalbigaj commented 2 years ago

Expected Behavior

Pipeline run with unscheduled custom test should be cancellable.

Regression: This scenario works correctly on v0.33.1.

Actual Behavior

When cancelling a pipeline run, which contains unscheduled custom test, the error PipelineRunCouldntCancel occurs.

Steps to Reproduce the Problem

  1. Install wait-task custom task controller.
  2. Run a sample pipeline run with tow steps, one depended on the other:
    apiVersion: tekton.dev/v1beta1
    kind: PipelineRun
    metadata:
    generateName: cancel-test-
    spec:
    pipelineSpec:
    tasks:
    - name: wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
      params:
        - name: duration
          value: 1h
    - name: wait-2
      runAfter:
        - wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
      params:
        - name: duration
          value: 10s
    serviceAccountName: pipeline-runner
  3. Cancel the pipeline run setting spec.status to "Cancelled". The final PipelineRun status is:
    apiVersion: tekton.dev/v1beta1
    kind: PipelineRun
    metadata:
    creationTimestamp: "2022-08-08T09:09:13Z"
    generateName: cancel-test-
    generation: 2
    labels:
    tekton.dev/pipeline: cancel-test-qwndq
    name: cancel-test-qwndq
    namespace: wait-task
    resourceVersion: "10224"
    uid: 5cb9a96f-fbfe-4103-961f-81ff775c0a33
    spec:
    pipelineSpec:
    tasks:
    - name: wait-1
      params:
      - name: duration
        value: 1h
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
    - name: wait-2
      params:
      - name: duration
        value: 10s
      runAfter:
      - wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
    serviceAccountName: pipeline-runner
    status: Cancelled
    timeout: 1h0m0s
    status:
    conditions:
    - lastTransitionTime: "2022-08-08T09:10:18Z"
    message: 'PipelineRun "cancel-test-qwndq" was cancelled but had errors trying
      to cancel TaskRuns and/or Runs: Failed to patch Run `cancel-test-qwndq-wait-2`
      with cancellation: runs.tekton.dev "cancel-test-qwndq-wait-2" not found'
    reason: PipelineRunCouldntCancel
    status: Unknown
    type: Succeeded
    pipelineSpec:
    tasks:
    - name: wait-1
      params:
      - name: duration
        value: 1h
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
    - name: wait-2
      params:
      - name: duration
        value: 10s
      runAfter:
      - wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
    runs:
    cancel-test-qwndq-wait-1:
      pipelineTaskName: wait-1
      status:
        extraFields: null
    cancel-test-qwndq-wait-2:
      pipelineTaskName: wait-2
    startTime: "2022-08-08T09:09:13Z"

Additional Info

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:36:49Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
Client version: 0.25.0
Pipeline version: v0.38.2
afrittoli commented 2 years ago

Thank you for the bug report @rafalbigaj - it sounds like we don't have test coverage for this. Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios. /cc @abayer @lbernick

afrittoli commented 2 years ago

@rafalbigaj Does the cancel happen right away? Both runs are in the status, which is why the controller attempts to cancel both and fails. It could be that that is the source of the issue. Are you using minimal embedded status in your setup? Any other alpha flag enabled?

abayer commented 2 years ago

Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios

👍 Agreed!

dibyom commented 2 years ago

cc https://github.com/tektoncd/pipeline/issues/5120

pritidesai commented 2 years ago

Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios

Would this require changing pluming to deploy wait task controller or can it be done via e2e-common (similar to pipeline controller)? Should we bring wait task controller into pipeline repo to decouple from whats available in experimental repo?

afrittoli commented 2 years ago

I wrote a reproducer unit test, so we can test this even without e2e tests, nonetheless it would be good to install the wait controller.

/*
Copyright 2022 The Tekton Authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package pipelinerun

import (
    "fmt"
    "testing"

    "github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1"
    "github.com/tektoncd/pipeline/test"
    "github.com/tektoncd/pipeline/test/parse"
    corev1 "k8s.io/api/core/v1"
    logtesting "knative.dev/pkg/logging/testing"

    _ "knative.dev/pkg/system/testing" // Setup system.Namespace()
)

func TestReconcileTwoCustomTasks(t *testing.T) {
    pipelineRunName := "cancel-test-run"
    prs := []*v1beta1.PipelineRun{parse.MustParsePipelineRun(t, `metadata:
  name: cancel-test-run
  namespace: foo
spec:
  pipelineSpec:
    tasks:
      - name: wait-1
        taskSpec:
          apiVersion: example.dev/v0
          kind: Wait
          params:
            - name: duration
              value: 1h
      - name: wait-2
        runAfter:
          - wait-1
        taskSpec:
          apiVersion: example.dev/v0
          kind: Wait
          params:
            - name: duration
              value: 10s
`)}

    cms := []*corev1.ConfigMap{withCustomTasks(newFeatureFlagsConfigMap())}

    d := test.Data{
        PipelineRuns: prs,
        ConfigMaps:   cms,
    }
    prt := newPipelineRunTest(d, t)
    defer prt.Cancel()

    pr, clients := prt.reconcileRun("foo", pipelineRunName, []string{}, false)
    fmt.Printf("%v", pr)
    err := cancelPipelineRun(prt.TestAssets.Ctx, logtesting.TestLogger(t), pr, clients.Pipeline)
    if err != nil {
        t.Fatalf("Error found: %v", err)
    }
}

I used git bisect to find the first failing commit 9d60d0adf41e74f78029a0eaab73140fa4f7206a

$ git bisect start v0.37.0 v0.36.0 --
$ git bisect run go test -v github.com/tektoncd/pipeline/pkg/reconciler/pipelinerun -run TestReconcileTwoCustomTasks
(...)
9d60d0adf41e74f78029a0eaab73140fa4f7206a is the first bad commit

On the surface it looks unrelated, I need to dig in to understand what happened.

afrittoli commented 2 years ago

The issue happens only with full status - with minimal status it works fine - i.e. in the test, using:

    cms := []*corev1.ConfigMap{withCustomTasks(withEmbeddedStatus(newFeatureFlagsConfigMap(), config.MinimalEmbeddedStatus))}

Works on main.