Start measuring Tekton Pipelines performance

bobcatfish commented 5 years ago

Expected Behavior

We should be measuring performance for Pipelines. This task includes both adding the actual measurement mechanism and also the design re. what exactly we want to measurement.

Some ideas for measurement:

Null Task / null Pipeline (i.e. it doesnt actually do anything)
Null Tasks that have linked inputs outputs
Stress testing (make recommendations about cluster size)
?

Requirements

We should have a set of "happy SLOs" defined for Task and Pipeline execution
We should be regularly measuring these SLOs
Maintainers should be made aware when we are in violation of these SLOs

Actual Behavior

We do not measure or track this.

Additional Info

Other knative projects use this https://testgrid-dot-knative-tests.appspot.com/knative-build#latency
Note that the metrics collector being used by these projects is global and our end to end tests run in parallel, so the metrics will not work as is

pradeepitm12 commented 5 years ago

Hello @bobcatfish Need your thoughts on this. 1- A service outside of tekton that watches tekton object and expose it to prometheus. 2- Introduce an endpoint in the tekton pipeline itself to expose all the metric to Prometheus.

bobcatfish commented 5 years ago

My gut feeling is that I'd lean more toward exposing the metrics from Pipelines itself:

2- Introduce an endpoint in the tekton pipeline itself to expose all the metric to Prometheus.

Question: I'm not super familair with Prometheus, how vital would it be to making metrics usable? Could we simply emit the metrics, and allow the user to provide their own metrics gathering mechanism (which could be prometheus but could be something else), or would it make more sense for us to include Prometheus out of the box? (I've very sensitive to adding new dependencies, esp. since I'm under the impression that managing Prometheus is a job in itself, but maybe I'm wrong!)

Another option, which I think is a variation on your first suggestion @pradeepitm12 : 3 - (For now) only measure the performance in tests we write specifically for this purpose (i.e. we don't expose anything new for users of Tekton Pipelines, but we start doing our own measurements)

rawlingsj commented 5 years ago

+1 we're looking at the same thing and just started looking at prometheus too, hopefully we can help each other out here.

:chart_with_upwards_trend:

bobcatfish commented 5 years ago

+1 we're looking at the same thing and just started looking at prometheus too, hopefully we can help each other out here.

Maybe the first thing to do would be to identify the metrics we're interested in? I'm not super familiar with prometheus but I would think before we want to monitor the metrics, we'd want to figure out what needs monitoring (maybe there's a Jenkins/Jenkins X precedent we can draw on :D?)

ghost commented 4 years ago

We had our first meeting regarding observability, specifically metrics, today and work is now underway. There are a couple of other issues that overlap in theme with this one. I am linking them together here for us to review later and figure out which to keep and which to close.

Metrics Design Doc

Notes from the initial metrics meeting

tekton-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

bobcatfish commented 3 years ago

We haven't worked on this lately but it is an item in our roadmap and I think we should keep it open.

/lifecycle frozen

bobcatfish commented 3 years ago

I want to start gathering some requirements around this and get it moving :D

bobcatfish commented 3 years ago

https://github.com/tektoncd/pipeline/issues/3521 has some use cases that we might be able use

mengjieli0726 commented 3 years ago

@bobcatfish, any tektone performance white paper have? as so far, how many pipeline run or run we can support in middle cluster (just like: 1 master + 1 compute node.) the node spec: 8 core + 64 G memory + 250 G disk.

tektoncd / pipeline