tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.37k stars 1.76k forks source link

taskruns_pod_latency reports pod label on task/namespace metric level #7553

Open zhouhaibing089 opened 6 months ago

zhouhaibing089 commented 6 months ago

We run tekton version 0.44.0, and we have the following options configured in config-observability configMap:

metrics.taskrun.level: "task"
metrics.taskrun.duration-type: "histogram"
metrics.pipelinerun.level: "pipeline"
metrics.pipelinerun.duration-type: "histogram"

Here is an example of metrics series copied from our running instance:

tekton_pipelines_controller_taskruns_pod_latency{namespace="<ns>",pod="<pod-1>",task="anonymous"} 0
tekton_pipelines_controller_taskruns_pod_latency{namespace="<ns>",pod="<pod-2>",task="anonymous"} 0
tekton_pipelines_controller_taskruns_pod_latency{namespace="<ns>",pod="<pod-3>",task="anonymous"} 0
tekton_pipelines_controller_taskruns_pod_latency{namespace="<ns>",pod="<pod-4>",task="anonymous"} 0

If I understand correctly, a level of task should eliminate the label taskrun, and since pod is similar to taskrun label (1:1 mapping), I'd consider that pod label should be dropped, too. As is, it is going to create a fast increasing series of metrics.

For reference, https://github.com/tektoncd/community/blob/main/teps/0073-simplify-metrics.md.

Pangjiping commented 4 months ago

We have also encountered this issue in the production env (over 100,000 pods per day), where the controller has spent a significant amount of time dealing with Prometheus metrics, leading to a severe degradation of core performance. These unreasonable metrics should be removed in the future. image

Pangjiping commented 4 months ago

Currently, we have to solve this issue by implementing special liveness probes or restarting timely.