Closed mattmoor closed 2 years ago
Every so often this test will fail with:
timeout_test.go:102: Error waiting for PipelineRun pipeline-run-timeout-qaakhkfe to finish: "pipeline-run-timeout-qaakhkfe" completed with the wrong reason: Failed
The setup is a task that sleeps for 10s and a pipeline that has a 5s timeout: https://github.com/tektoncd/pipeline/blob/adc4cfa733af86874017d07459cc5ad62bf82bd3/test/v1alpha1/timeout_test.go#L49-L61
This check expects the "Reason" to be "PipelineRunTimeout": https://github.com/tektoncd/pipeline/blob/adc4cfa733af86874017d07459cc5ad62bf82bd3/test/v1alpha1/timeout_test.go#L100-L103
However, when the test intermittently fails the status shows the TaskRun timed out, but the PipelineRun Failed:
spec:
pipelineRef:
name: pipeline-run-timeout-kjlunngp
serviceAccountName: default
timeout: 5s <-- Expected to timeout (the task sleeps for 10s)
status:
completionTime: "2020-10-24T12:37:13Z"
conditions:
- lastTransitionTime: "2020-10-24T12:37:13Z"
message: 'Tasks Completed: 1 (Failed: 1, Cancelled 0), Skipped: 0'
reason: Failed <-- We get Failed instead of PipelineRunTimeout
status: "False"
type: Succeeded
pipelineSpec:
tasks:
- name: foo
taskRef:
kind: Task
name: pipeline-run-timeout-lfsmutbs
startTime: "2020-10-24T12:37:09Z"
taskRuns:
pipeline-run-timeout-mncrmugc-foo-74kf8:
pipelineTaskName: foo
status:
completionTime: "2020-10-24T12:37:13Z"
conditions:
- lastTransitionTime: "2020-10-24T12:37:13Z"
message: TaskRun "pipeline-run-timeout-mncrmugc-foo-74kf8" failed to finish within "5s"
reason: TaskRunTimeout <-- The task run timed out (as expected) though
status: "False"
type: Succeeded
The irony is that the TaskRun gets its timeout from the PipelineRun here: https://github.com/tektoncd/pipeline/blob/6d8f4513213a0a85f5164e0ed13b8a15c1839f89/pkg/reconciler/pipelinerun/pipelinerun.go#L807-L816
When we transition the PipelineRun to "Failed" we skip this check: https://github.com/tektoncd/pipeline/blob/3185d055fa0959830d3c986407a1e6e58f074003/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go#L206-L218
... digging into why, it looks at the StartTime in status, and in the cases where we fail the StartTime is less than 5s from the time of completion(!):
PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
StartTime: s"2020-10-25 20:35:07 +0000 UTC",
- CompletionTime: nil,
+ CompletionTime: s"2020-10-25 20:35:11.874768406 +0000 UTC m=+601.330431507",
TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{
The same status update shows us propagating the following for the TaskRun:
TaskRunStatusFields: v1beta1.TaskRunStatusFields{
PodName: "pipeline-run-timeout-qaakhkfe-foo-zzlbf-pod-577ph",
StartTime: s"2020-10-25 20:35:06 +0000 UTC",
- CompletionTime: nil,
+ CompletionTime: s"2020-10-25 20:35:11 +0000 UTC",
Generally this means that any PipelineRun specifying a Timeout may simple show up as Failed due to this race.
I think that my $0.02 on the appropriate "fix" would be that the PipelineRun's "StartTime" should never be greater than the "StartTime" of its constituent TaskRuns. This should guarantee that any copied-down timeout never manifests this way.
cc @vdemeester @afrittoli @ImJasonH @bobcatfish for thoughts 🙏
Sorry, I should mention that the first bit of yaml is actually a dump from a prior run that I lifted from my older bug that this replaces. So don't try to rationalize it with the later "diffs".
The complete run for the latest failure (that I debugged) was here: https://github.com/mattmoor/mink/runs/1305814401?check_suite_focus=true#step:17:58
I think that this is derived from state informer caches and the retry logic for updating status that we have.
On a subsequent reconciliation of the PipelineRun (+1s) from the stale uninitialized version, we re-InitializeConditions, which resets StartTime, but a prior reconciliation had already created the TaskRuns (likely what triggered us to be reconciled).
The diff for this status update looks like:
kubelogs.go:197: D 20:35:07.329 controlplane-2 [pipelinerun/reconciler.go:322] [arendelle-px7ff/pipeline-run-timeout-qaakhkfe] Updating status with: v1beta1.PipelineRunStatus{
Status: v1beta1.Status{
ObservedGeneration: 0,
- Conditions: nil,
+ Conditions: v1beta1.Conditions{
+ {
+ Type: "Succeeded",
+ Status: "Unknown",
+ LastTransitionTime: apis.VolatileTime{Inner: s"2020-10-25 20:35:07.188165477 +0"...},
+ Reason: "Running",
+ Message: "Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skip"...,
+ },
+ },
Annotations: nil,
},
PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
- StartTime: nil,
+ StartTime: s"2020-10-25 20:35:07.055779791 +0000 UTC m=+596.511442892",
CompletionTime: nil,
- TaskRuns: nil,
+ TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{
+ "pipeline-run-timeout-qaakhkfe-foo-zzlbf": &{PipelineTaskName: "foo", Status: &v1beta1.TaskRunStatus{}},
+ },
PipelineResults: nil,
- PipelineSpec: nil,
+ PipelineSpec: &v1beta1.PipelineSpec{Tasks: []v1beta1.PipelineTask{{Name: "foo", TaskRef: &v1beta1.TaskRef{...}}}},
SkippedTasks: nil,
},
}
Note the empty StartTime in the base, despite this in a prior update:
PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
- StartTime: nil,
+ StartTime: s"2020-10-25 20:35:06.759884985 +0000 UTC m=+596.215548086",
TestPipelineRunTimeout test is still flaky
fixed in https://github.com/tektoncd/pipeline/pull/5134 🤞🏾
thanks @abayer 🎉
Expected Behavior
TestPipelineRunTimeout consistently passes with the properly observed failure mode.
Actual Behavior
TestPipelineRunTimeout often observes a "Failed" status on the PipelineRun(!!!).
Steps to Reproduce the Problem
Run the test a lot.
I will post my analysis here shortly, as I believe I know WHY this is happening.