Closed skaegi closed 4 years ago
Watching this thread closely. We did an upgrade from v0.10 to v0.17 a couple of weeks ago (had to excise PipelineResources) and performance has been all over the place since then. I'm thinking about downgrading at this point. I've looked into the two HA solutions in this isssue but the results have not been great -- sometimes we see double work or random task restarts when the task should be finishing with the StatefulSet approach and hardly any performance increase with the documented approach. Our pipelines go from 2 concurrent tasks -> 6 concurrent tasks -> 96 concurrent tasks
and at any given time there are 15 running on the cluster. Each pipeline gets its own namespace if that makes any difference.
If I can provide any data or notes in this area I am eager to help.
Uhm, pipeline 40
started quickly on my kind cluster, but I found an issue in the pipeline definition:
status:
completionTime: "2020-11-14T10:05:06Z"
conditions:
- lastTransitionTime: "2020-11-14T10:05:06Z"
message: 'unable to find result referenced by param "a" in "t2": Could not find
result with name a for task run t1'
reason: InvalidTaskResultReference
status: "False"
type: Succeeded
@skaegi I think you always provision the pipeline and task definitions along with the pipeline run, is that correct?
Fixed the task as follows:
steps:
- name: echo
image: alpine
script: |
echo "a-result" > $(results.a.path)
echo "b-result" > $(results.b.path)
echo "c-result" > $(results.c.path)
echo "d-result" > $(results.d.path)
echo "e-result" > $(results.e.path)
echo "f-result" > $(results.f.path)
echo "g-result" > $(results.g.path)
echo "h-result" > $(results.h.path)
echo "i-result" > $(results.i.path)
echo "j-result" > $(results.j.path)
The whole pipeline ran quite quickly:
NAME STARTED DURATION STATUS
pipeline-40-run-92kf8 2 minutes ago 1 minute Succeeded
With pipeline 50
I see evident signs of degradation already: 30s, then 1 taskrun, then a couple of minutes, then the rest.
We definitely get stuck in dag.GetSchedulable
, I'll try to build a unit test that matches the pipe60 pipeline and see if I can reproduce the issue there. This is the last line in the controller logs before the waiting starts...
{"level":"info","ts":"2020-11-14T16:23:35.940Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-pipelinerun.Reconciler","caller":"resources/pipelinerunstate.go:197","msg":"#3521 About to invoke dag.GetSchedulable","commit":"cdc171b","knative.dev/traceid":"42f40bcb-907c-49cb-8aa4-b184cc97b7a7","knative.dev/key":"default/pipeline-60-run-mhgkq"}
I added a simple unit test that emulates pipe60
, but that runs fine, it takes overall about 1s to run dag.GetSchedulable
twice, once with empty execution history and once after t1
: https://gist.github.com/afrittoli/be4dae78d46cdeb8cfd4a749839b66bb
successfulOrSkippedDAGTasks
is the function where all the time is spent:
https://github.com/tektoncd/pipeline/blob/406a21e5fb7524c2ca4944864cdcad8338f8054c/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go#L314-L324
The loop in successfulOrSkippedDAGTasks
calls t.Skip(facts)
for every pipeline task. The Skip
function https://github.com/tektoncd/pipeline/blob/cdc171b906cea29c894098924e3125bc27512b20/pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go#L136-L152 invokes parentTasksSkip
which invokes Skip
recursively.
On my local cluster I see that each iteration in the loop in successfulOrSkippedDAGTasks
can take up to 2 minutes.
Since we check for if t.IsSuccessful() || t.Skip(facts)
, once tasks start to run the number of times that we have to invoke Skip
decreases, unlocking the execution.
Even though recursion fits with the high CPU and long execution time that we see, it's still not clear to me why it would take so long; the algorithm can be surely improved upon, however there may be some bug in there that I'm still missing to really explain this.
It looks like it's ripe time I resume work on this refactor (https://github.com/tektoncd/pipeline/pull/2821) that stores the "won't run" status in the state and avoids the continuous loop over the entire pipeline.
In addition to the fix (thanks Andrea!) we should definitely encode this pipeline into a test to guard against future similar regressions.
It looks like this behavior is new since the last release, is that correct? So we'll only need a patch for the latest release?
@ImJasonH we originally found the problem in 0.16.3 but FWIW we're in the process of moving to 0.18.x so at least for us further back patching is not needed.
In addition to the fix (thanks Andrea!) we should definitely encode this pipeline into a test to guard against future similar regressions.
That is already done as a unit test that reproduces the issue (also in the PR). When I added the unit test it would time out after 30s and with the PR passes. On my laptop it manages a pipeline with 120 tasks in 30s, so there's still room for improvement.
It looks like this behavior is new since the last release, is that correct? So we'll only need a patch for the latest release?
Perhaps we could add the pipeline60 as an e2e test too. I would not add it as a YAML test because it's easier to maintain when it's generated via code. It takes a couple of minutes to run on my kind cluster, so it should be fine for E2E tests.
We are hitting a scalability problem that pins the controller in our clusters when we have a complex pipeline with close to 50 tasks that makes heavy use of params and results. I'll enclose a synthetic test case but this is a real world case that we are seeing in our clusters.
What we see is that the controller is CPU pinned before creating the first Task pod. In my cluster what I'm seeing is a very sudden degradation in start time for the synthetic test case when I run a pipeline that contains: 1) 40 Tasks -- 10 seconds to start 2) 50 Tasks -- 1 minute to start 3) 60 Tasks -- 41 minutes to start ... that feels like we have an algorithm that is out of whack.
The following synthetic tests are standalone and don't use a workspace or anything fancy and might be good in our tests later on...
pipe40p.yaml.txt pipe50p.yaml.txt pipe60p.yaml.txt