tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.48k stars 1.78k forks source link

TestPipelineRunTimeout is flaky #3460

Closed mattmoor closed 2 years ago

mattmoor commented 4 years ago

Expected Behavior

TestPipelineRunTimeout consistently passes with the properly observed failure mode.

Actual Behavior

TestPipelineRunTimeout often observes a "Failed" status on the PipelineRun(!!!).

Steps to Reproduce the Problem

Run the test a lot.

I will post my analysis here shortly, as I believe I know WHY this is happening.

mattmoor commented 4 years ago

Every so often this test will fail with:

timeout_test.go:102: Error waiting for PipelineRun pipeline-run-timeout-qaakhkfe to finish: "pipeline-run-timeout-qaakhkfe" completed with the wrong reason: Failed

The setup is a task that sleeps for 10s and a pipeline that has a 5s timeout: https://github.com/tektoncd/pipeline/blob/adc4cfa733af86874017d07459cc5ad62bf82bd3/test/v1alpha1/timeout_test.go#L49-L61

This check expects the "Reason" to be "PipelineRunTimeout": https://github.com/tektoncd/pipeline/blob/adc4cfa733af86874017d07459cc5ad62bf82bd3/test/v1alpha1/timeout_test.go#L100-L103

However, when the test intermittently fails the status shows the TaskRun timed out, but the PipelineRun Failed:

        spec:
          pipelineRef:
            name: pipeline-run-timeout-kjlunngp
          serviceAccountName: default
          timeout: 5s     <-- Expected to timeout (the task sleeps for 10s)
        status:
          completionTime: "2020-10-24T12:37:13Z"
          conditions:
          - lastTransitionTime: "2020-10-24T12:37:13Z"
            message: 'Tasks Completed: 1 (Failed: 1, Cancelled 0), Skipped: 0'
            reason: Failed    <-- We get Failed instead of PipelineRunTimeout
            status: "False"
            type: Succeeded
          pipelineSpec:
            tasks:
            - name: foo
              taskRef:
                kind: Task
                name: pipeline-run-timeout-lfsmutbs
          startTime: "2020-10-24T12:37:09Z"
          taskRuns:
            pipeline-run-timeout-mncrmugc-foo-74kf8:
              pipelineTaskName: foo
              status:
                completionTime: "2020-10-24T12:37:13Z"
                conditions:
                - lastTransitionTime: "2020-10-24T12:37:13Z"
                  message: TaskRun "pipeline-run-timeout-mncrmugc-foo-74kf8" failed to finish within "5s"
                  reason: TaskRunTimeout   <-- The task run timed out (as expected) though
                  status: "False"
                  type: Succeeded

The irony is that the TaskRun gets its timeout from the PipelineRun here: https://github.com/tektoncd/pipeline/blob/6d8f4513213a0a85f5164e0ed13b8a15c1839f89/pkg/reconciler/pipelinerun/pipelinerun.go#L807-L816

When we transition the PipelineRun to "Failed" we skip this check: https://github.com/tektoncd/pipeline/blob/3185d055fa0959830d3c986407a1e6e58f074003/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go#L206-L218

... digging into why, it looks at the StartTime in status, and in the cases where we fail the StartTime is less than 5s from the time of completion(!):

            PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
                StartTime:      s"2020-10-25 20:35:07 +0000 UTC",
        -       CompletionTime: nil,
        +       CompletionTime: s"2020-10-25 20:35:11.874768406 +0000 UTC m=+601.330431507",
                TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{

The same status update shows us propagating the following for the TaskRun:

                            TaskRunStatusFields: v1beta1.TaskRunStatusFields{
                                PodName:        "pipeline-run-timeout-qaakhkfe-foo-zzlbf-pod-577ph",
                                StartTime:      s"2020-10-25 20:35:06 +0000 UTC",
        -                       CompletionTime: nil,
        +                       CompletionTime: s"2020-10-25 20:35:11 +0000 UTC",
mattmoor commented 4 years ago

Generally this means that any PipelineRun specifying a Timeout may simple show up as Failed due to this race.

I think that my $0.02 on the appropriate "fix" would be that the PipelineRun's "StartTime" should never be greater than the "StartTime" of its constituent TaskRuns. This should guarantee that any copied-down timeout never manifests this way.

cc @vdemeester @afrittoli @ImJasonH @bobcatfish for thoughts 🙏

mattmoor commented 4 years ago

Sorry, I should mention that the first bit of yaml is actually a dump from a prior run that I lifted from my older bug that this replaces. So don't try to rationalize it with the later "diffs".

The complete run for the latest failure (that I debugged) was here: https://github.com/mattmoor/mink/runs/1305814401?check_suite_focus=true#step:17:58

mattmoor commented 4 years ago

I think that this is derived from state informer caches and the retry logic for updating status that we have.

On a subsequent reconciliation of the PipelineRun (+1s) from the stale uninitialized version, we re-InitializeConditions, which resets StartTime, but a prior reconciliation had already created the TaskRuns (likely what triggered us to be reconciled).

The diff for this status update looks like:

    kubelogs.go:197: D 20:35:07.329 controlplane-2 [pipelinerun/reconciler.go:322] [arendelle-px7ff/pipeline-run-timeout-qaakhkfe] Updating status with:   v1beta1.PipelineRunStatus{
            Status: v1beta1.Status{
                ObservedGeneration: 0,
        -       Conditions:         nil,
        +       Conditions: v1beta1.Conditions{
        +           {
        +               Type:               "Succeeded",
        +               Status:             "Unknown",
        +               LastTransitionTime: apis.VolatileTime{Inner: s"2020-10-25 20:35:07.188165477 +0"...},
        +               Reason:             "Running",
        +               Message:            "Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skip"...,
        +           },
        +       },
                Annotations: nil,
            },
            PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        -       StartTime:      nil,
        +       StartTime:      s"2020-10-25 20:35:07.055779791 +0000 UTC m=+596.511442892",
                CompletionTime: nil,
        -       TaskRuns:       nil,
        +       TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{
        +           "pipeline-run-timeout-qaakhkfe-foo-zzlbf": &{PipelineTaskName: "foo", Status: &v1beta1.TaskRunStatus{}},
        +       },
                PipelineResults: nil,
        -       PipelineSpec:    nil,
        +       PipelineSpec:    &v1beta1.PipelineSpec{Tasks: []v1beta1.PipelineTask{{Name: "foo", TaskRef: &v1beta1.TaskRef{...}}}},
                SkippedTasks:    nil,
            },
          }

Note the empty StartTime in the base, despite this in a prior update:

            PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        -       StartTime:      nil,
        +       StartTime:      s"2020-10-25 20:35:06.759884985 +0000 UTC m=+596.215548086",
jerop commented 2 years ago

TestPipelineRunTimeout test is still flaky

jerop commented 2 years ago

fixed in https://github.com/tektoncd/pipeline/pull/5134 🤞🏾

thanks @abayer 🎉