Open zhouhaibing089 opened 6 months ago
We have also encountered this issue in the production env (over 100,000 pods per day), where the controller has spent a significant amount of time dealing with Prometheus metrics, leading to a severe degradation of core performance. These unreasonable metrics should be removed in the future.
Currently, we have to solve this issue by implementing special liveness probes or restarting timely.
We run tekton version 0.44.0, and we have the following options configured in
config-observability
configMap:Here is an example of metrics series copied from our running instance:
If I understand correctly, a level of
task
should eliminate the labeltaskrun
, and sincepod
is similar totaskrun
label (1:1 mapping), I'd consider thatpod
label should be dropped, too. As is, it is going to create a fast increasing series of metrics.For reference, https://github.com/tektoncd/community/blob/main/teps/0073-simplify-metrics.md.