tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.51k stars 1.78k forks source link

Pinning CPU with a pipeline with 50 tasks and use of params/results chain #3521

Closed skaegi closed 4 years ago

skaegi commented 4 years ago

We are hitting a scalability problem that pins the controller in our clusters when we have a complex pipeline with close to 50 tasks that makes heavy use of params and results. I'll enclose a synthetic test case but this is a real world case that we are seeing in our clusters.

What we see is that the controller is CPU pinned before creating the first Task pod. In my cluster what I'm seeing is a very sudden degradation in start time for the synthetic test case when I run a pipeline that contains: 1) 40 Tasks -- 10 seconds to start 2) 50 Tasks -- 1 minute to start 3) 60 Tasks -- 41 minutes to start ... that feels like we have an algorithm that is out of whack.

The following synthetic tests are standalone and don't use a workspace or anything fancy and might be good in our tests later on...

pipe40p.yaml.txt pipe50p.yaml.txt pipe60p.yaml.txt

dustym commented 4 years ago

Watching this thread closely. We did an upgrade from v0.10 to v0.17 a couple of weeks ago (had to excise PipelineResources) and performance has been all over the place since then. I'm thinking about downgrading at this point. I've looked into the two HA solutions in this isssue but the results have not been great -- sometimes we see double work or random task restarts when the task should be finishing with the StatefulSet approach and hardly any performance increase with the documented approach. Our pipelines go from 2 concurrent tasks -> 6 concurrent tasks -> 96 concurrent tasks and at any given time there are 15 running on the cluster. Each pipeline gets its own namespace if that makes any difference.

If I can provide any data or notes in this area I am eager to help.

afrittoli commented 4 years ago

Uhm, pipeline 40 started quickly on my kind cluster, but I found an issue in the pipeline definition:

status:
  completionTime: "2020-11-14T10:05:06Z"
  conditions:
  - lastTransitionTime: "2020-11-14T10:05:06Z"
    message: 'unable to find result referenced by param "a" in "t2": Could not find
      result with name a for task run t1'
    reason: InvalidTaskResultReference
    status: "False"
    type: Succeeded
afrittoli commented 4 years ago

@skaegi I think you always provision the pipeline and task definitions along with the pipeline run, is that correct?

afrittoli commented 4 years ago

Fixed the task as follows:

  steps:
  - name: echo
    image: alpine
    script: |
      echo "a-result" > $(results.a.path)
      echo "b-result" > $(results.b.path)
      echo "c-result" > $(results.c.path)
      echo "d-result" > $(results.d.path)
      echo "e-result" > $(results.e.path)
      echo "f-result" > $(results.f.path)
      echo "g-result" > $(results.g.path)
      echo "h-result" > $(results.h.path)
      echo "i-result" > $(results.i.path)
      echo "j-result" > $(results.j.path)

The whole pipeline ran quite quickly:

NAME                    STARTED         DURATION   STATUS
pipeline-40-run-92kf8   2 minutes ago   1 minute   Succeeded
afrittoli commented 4 years ago

With pipeline 50 I see evident signs of degradation already: 30s, then 1 taskrun, then a couple of minutes, then the rest.

afrittoli commented 4 years ago

We definitely get stuck in dag.GetSchedulable, I'll try to build a unit test that matches the pipe60 pipeline and see if I can reproduce the issue there. This is the last line in the controller logs before the waiting starts...

{"level":"info","ts":"2020-11-14T16:23:35.940Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-pipelinerun.Reconciler","caller":"resources/pipelinerunstate.go:197","msg":"#3521 About to invoke dag.GetSchedulable","commit":"cdc171b","knative.dev/traceid":"42f40bcb-907c-49cb-8aa4-b184cc97b7a7","knative.dev/key":"default/pipeline-60-run-mhgkq"}
afrittoli commented 4 years ago

I added a simple unit test that emulates pipe60, but that runs fine, it takes overall about 1s to run dag.GetSchedulable twice, once with empty execution history and once after t1: https://gist.github.com/afrittoli/be4dae78d46cdeb8cfd4a749839b66bb

afrittoli commented 4 years ago

successfulOrSkippedDAGTasks is the function where all the time is spent: https://github.com/tektoncd/pipeline/blob/406a21e5fb7524c2ca4944864cdcad8338f8054c/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go#L314-L324

The loop in successfulOrSkippedDAGTasks calls t.Skip(facts) for every pipeline task. The Skip function https://github.com/tektoncd/pipeline/blob/cdc171b906cea29c894098924e3125bc27512b20/pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go#L136-L152 invokes parentTasksSkip which invokes Skip recursively.

On my local cluster I see that each iteration in the loop in successfulOrSkippedDAGTasks can take up to 2 minutes. Since we check for if t.IsSuccessful() || t.Skip(facts), once tasks start to run the number of times that we have to invoke Skip decreases, unlocking the execution.

Even though recursion fits with the high CPU and long execution time that we see, it's still not clear to me why it would take so long; the algorithm can be surely improved upon, however there may be some bug in there that I'm still missing to really explain this.

It looks like it's ripe time I resume work on this refactor (https://github.com/tektoncd/pipeline/pull/2821) that stores the "won't run" status in the state and avoids the continuous loop over the entire pipeline.

imjasonh commented 4 years ago

In addition to the fix (thanks Andrea!) we should definitely encode this pipeline into a test to guard against future similar regressions.

It looks like this behavior is new since the last release, is that correct? So we'll only need a patch for the latest release?

skaegi commented 4 years ago

@ImJasonH we originally found the problem in 0.16.3 but FWIW we're in the process of moving to 0.18.x so at least for us further back patching is not needed.

afrittoli commented 4 years ago

In addition to the fix (thanks Andrea!) we should definitely encode this pipeline into a test to guard against future similar regressions.

That is already done as a unit test that reproduces the issue (also in the PR). When I added the unit test it would time out after 30s and with the PR passes. On my laptop it manages a pipeline with 120 tasks in 30s, so there's still room for improvement.

It looks like this behavior is new since the last release, is that correct? So we'll only need a patch for the latest release?

afrittoli commented 4 years ago

Perhaps we could add the pipeline60 as an e2e test too. I would not add it as a YAML test because it's easier to maintain when it's generated via code. It takes a couple of minutes to run on my kind cluster, so it should be fine for E2E tests.