rafalbigaj commented 2 years ago

Expected Behavior

Pipeline run with unscheduled custom test should be cancellable.

Regression: This scenario works correctly on v0.33.1.

Actual Behavior

When cancelling a pipeline run, which contains unscheduled custom test, the error PipelineRunCouldntCancel occurs.

Steps to Reproduce the Problem

Install wait-task custom task controller.

Run a sample pipeline run with tow steps, one depended on the other:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: cancel-test-
spec:
pipelineSpec:
tasks:
- name: wait-1
  taskSpec:
    apiVersion: example.dev/v0
    kind: Wait
  params:
    - name: duration
      value: 1h
- name: wait-2
  runAfter:
    - wait-1
  taskSpec:
    apiVersion: example.dev/v0
    kind: Wait
  params:
    - name: duration
      value: 10s
serviceAccountName: pipeline-runner

Cancel the pipeline run setting spec.status to "Cancelled". The final PipelineRun status is:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
creationTimestamp: "2022-08-08T09:09:13Z"
generateName: cancel-test-
generation: 2
labels:
tekton.dev/pipeline: cancel-test-qwndq
name: cancel-test-qwndq
namespace: wait-task
resourceVersion: "10224"
uid: 5cb9a96f-fbfe-4103-961f-81ff775c0a33
spec:
pipelineSpec:
tasks:
- name: wait-1
  params:
  - name: duration
    value: 1h
  taskSpec:
    apiVersion: example.dev/v0
    kind: Wait
    metadata: {}
    spec: null
- name: wait-2
  params:
  - name: duration
    value: 10s
  runAfter:
  - wait-1
  taskSpec:
    apiVersion: example.dev/v0
    kind: Wait
    metadata: {}
    spec: null
serviceAccountName: pipeline-runner
status: Cancelled
timeout: 1h0m0s
status:
conditions:
- lastTransitionTime: "2022-08-08T09:10:18Z"
message: 'PipelineRun "cancel-test-qwndq" was cancelled but had errors trying
  to cancel TaskRuns and/or Runs: Failed to patch Run `cancel-test-qwndq-wait-2`
  with cancellation: runs.tekton.dev "cancel-test-qwndq-wait-2" not found'
reason: PipelineRunCouldntCancel
status: Unknown
type: Succeeded
pipelineSpec:
tasks:
- name: wait-1
  params:
  - name: duration
    value: 1h
  taskSpec:
    apiVersion: example.dev/v0
    kind: Wait
    metadata: {}
    spec: null
- name: wait-2
  params:
  - name: duration
    value: 10s
  runAfter:
  - wait-1
  taskSpec:
    apiVersion: example.dev/v0
    kind: Wait
    metadata: {}
    spec: null
runs:
cancel-test-qwndq-wait-1:
  pipelineTaskName: wait-1
  status:
    extraFields: null
cancel-test-qwndq-wait-2:
  pipelineTaskName: wait-2
startTime: "2022-08-08T09:09:13Z"

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:36:49Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

Client version: 0.25.0
Pipeline version: v0.38.2

afrittoli commented 2 years ago

Thank you for the bug report @rafalbigaj - it sounds like we don't have test coverage for this. Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios. /cc @abayer @lbernick

afrittoli commented 2 years ago

@rafalbigaj Does the cancel happen right away? Both runs are in the status, which is why the controller attempts to cancel both and fails. It could be that that is the source of the issue. Are you using minimal embedded status in your setup? Any other alpha flag enabled?

abayer commented 2 years ago

Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios

👍 Agreed!

dibyom commented 2 years ago

cc https://github.com/tektoncd/pipeline/issues/5120

pritidesai commented 2 years ago

Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios

Would this require changing pluming to deploy wait task controller or can it be done via e2e-common (similar to pipeline controller)? Should we bring wait task controller into pipeline repo to decouple from whats available in experimental repo?

afrittoli commented 2 years ago

I wrote a reproducer unit test, so we can test this even without e2e tests, nonetheless it would be good to install the wait controller.

/*
Copyright 2022 The Tekton Authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package pipelinerun

import (
    "fmt"
    "testing"

    "github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1"
    "github.com/tektoncd/pipeline/test"
    "github.com/tektoncd/pipeline/test/parse"
    corev1 "k8s.io/api/core/v1"
    logtesting "knative.dev/pkg/logging/testing"

    _ "knative.dev/pkg/system/testing" // Setup system.Namespace()
)

func TestReconcileTwoCustomTasks(t *testing.T) {
    pipelineRunName := "cancel-test-run"
    prs := []*v1beta1.PipelineRun{parse.MustParsePipelineRun(t, `metadata:
  name: cancel-test-run
  namespace: foo
spec:
  pipelineSpec:
    tasks:
      - name: wait-1
        taskSpec:
          apiVersion: example.dev/v0
          kind: Wait
          params:
            - name: duration
              value: 1h
      - name: wait-2
        runAfter:
          - wait-1
        taskSpec:
          apiVersion: example.dev/v0
          kind: Wait
          params:
            - name: duration
              value: 10s
`)}

    cms := []*corev1.ConfigMap{withCustomTasks(newFeatureFlagsConfigMap())}

    d := test.Data{
        PipelineRuns: prs,
        ConfigMaps:   cms,
    }
    prt := newPipelineRunTest(d, t)
    defer prt.Cancel()

    pr, clients := prt.reconcileRun("foo", pipelineRunName, []string{}, false)
    fmt.Printf("%v", pr)
    err := cancelPipelineRun(prt.TestAssets.Ctx, logtesting.TestLogger(t), pr, clients.Pipeline)
    if err != nil {
        t.Fatalf("Error found: %v", err)
    }
}

I used git bisect to find the first failing commit 9d60d0adf41e74f78029a0eaab73140fa4f7206a

$ git bisect start v0.37.0 v0.36.0 --
$ git bisect run go test -v github.com/tektoncd/pipeline/pkg/reconciler/pipelinerun -run TestReconcileTwoCustomTasks
(...)
9d60d0adf41e74f78029a0eaab73140fa4f7206a is the first bad commit

On the surface it looks unrelated, I need to dig in to understand what happened.

afrittoli commented 2 years ago

The issue happens only with full status - with minimal status it works fine - i.e. in the test, using:

    cms := []*corev1.ConfigMap{withCustomTasks(withEmbeddedStatus(newFeatureFlagsConfigMap(), config.MinimalEmbeddedStatus))}

Works on main.

tektoncd / pipeline

[REGRESSION] Unable to cancel pipeline run - error PipelineRunCouldntCancel reported on a pending custom task #5282

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info