Open r0bj opened 3 years ago
I can take a look at it if no one else is. But it may take a longer time for me.
@ImJasonH will this be suitable as a good first issue?
/assign ywluogg
cc @NavidZ since this relates to metrics
Dropping this here for context. The webhook_request_latencies_bucket
(and others) metrics is heavily influenced by the labels in question here: https://github.com/knative/pkg/pull/1464/files
Removing the labels in that pull request might help reduce the number of unique webhook_request_latencies_bucket
metrics the webhook has to manage.
Aside from this, I don't know if theres a way to configure the metrics code to purge metrics from the in memory store after a period of time. This would help too. Most of the time, the in memory stuff is sent to a backend like Prometheus, stack driver, etc anyway.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/remove-lifecycle stale
@ywluogg are you still looking into this?
@vdemeester looks like this issue would be addressed by TEP-0073: Simplify metrics, right?
Hi jerop@ I'm not looking into this anymore. Please unassign me. Thanks!
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/assign @QuanZhang-William
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/remove-lifecycle stale
We have a very similar problem. Many metrics have a resource_namespace
label. In our case, these namespaces have randomly generated names and live for a short time. This leads to a very high cardinality for the resource_namespace
label in about one week. That huge number of series results in a growing memory consumption.
I agree with @eddie4941 that configuring the metrics code to purge metrics from the in memory store after a period of time would help.
Based on the discussion in the API WG: /assign @khrm
@pritidesai: GitHub didn't allow me to assign the following users: khrm.
Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
/assign @khrm
@khrm: You can't close an active issue/PR unless you authored it or you are a collaborator.
@khrm not only resource_name
label but also resource_namespace
label can contribute to this "high cardinality" issue. To fix it for every use case one would need to purge metrics from the in memory store after a period of time.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
This issue is still relevant. See this comment as well as this.
We have the same issue, too.
I have a proposal for knative/pkg at https://github.com/knative/pkg/pull/2931.
knative/pkg now gives the option to exclude arbitrary tags. I assume the next action item is to bump knative/pkg and customize the webhook options.
@khrm are you still working on this issue? We marked it as "blocking" for a v1 release and we would like to make a v1 release in July. If you are working on it, will you be able to solve this until then? Thank you!
@afrittoli This seems to be resolved. Will be fixed by knative update.
Do you know when this fix is expected to be part of Tekton?
FWIW, besides bumping knative.dev/pkg
, it is also necessary to override the default StatReporter options.
Since #7989 is in, I am sending #8033 as a followup which actually addresses this issue.
I could anticipate that there might be asks to make it as a configuration option, and maybe off by default, so if any pointers, that would be greatly appreciated.
Expected Behavior
Prometheus metric
webhook_request_latencies_bucket
is usable in real environment, don't add new data series forever. Prometheus is able to query that metric.Actual Behavior
Prometheus metric
webhook_request_latencies_bucket
creates so many data series that is practically impossible to query in prometheus (too much data). It keeps adding new series while it's running so number of series increase forever. Restart podtekton-pipelines-webhook
resets number of series and fixes issue.Steps to Reproduce the Problem
Run
tekton-pipelines-webhook
.Additional Info
Kubernetes version:
Tekton Pipeline version: