kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.51k stars 8.26k forks source link

nginx_ingress_controller_orphan_ingress accumulates very many series over time #10242

Open horihel opened 1 year ago

horihel commented 1 year ago

What happened: See Screenshot:

grafik

We've observed prometheus gradually use more and more memory over time - after some inspection we found that nginx_ingress_controller_orphan_ingress exports a really large amount of labels constantly - even for namespaces that don't exist any more for quite a while.

This cluster might be a bit of a special case as it creates/destroys namespaces with a 10-20 ingresses constantly to run tests.

It's easy to see that this adds up and this number does not go down (unless the nginx-pods are killed): grafik

What you expected to happen:

If I understand correctly, labels are usually kept on /metrics, but in this case (and maybe others), it might be worth considering not exporting the metric any more if the ingress has been deleted.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): This is rke2-ingress-nginx as shipped with rke2 v1.25.11+rke2r1.

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       nginx-1.6.4-hardened4
  Build:         git-90e1717ce
  Repository:    https://github.com/rancher/ingress-nginx.git
  nginx version: nginx/1.21.4

-------------------------------------------------------------------------------

I'm not sure if this version is a fork or vendored by Rancher, but glancing at the code it looks like orphans aren't removed in current mainline 1.8.1 too. I didn't test that yet though (sorry).

Kubernetes version (use kubectl version): v1.25.11+rke2r1

Environment: rke2 managed by rancher on vSphere

How to reproduce this issue:

Create a namespace with a few ingresses (doesn't matter if orphaned or not) Delete the namespace Observe metrics for ingresses stay in /metrics, and status for orphaned also stays in there

Anything else we need to know:

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 year ago

Your observation is correct. This happens in case of sects of type TLS as well. This has to do with the design that a object may no longer exist but the timeseries has data from the time that the object did exist.

There are not many resources to work on this issue. If you want to submit a PR, it will be welcome. For the similar problem with deleted TLS secrets being shown in graphs, I think there was a PR to change the promquery itself.

/remove-kind bug /kind feature

horihel commented 1 year ago

I'm afraid golang is out of my league (for now). We've dropped the metric in the prometheus config and RAM usage has stabilized.

Revolution1 commented 1 year ago

Also encountered the same issue, too many metrics of historical ingresses crashed my prometheus, took more than 30Gi mem to process

I'll try and see if I can make a pr about this

Revolution1 commented 1 year ago

related: #8230 @alex123012

Revolution1 commented 1 year ago

well I found orphan_ingress is not the only metric that accumulates over time nginx_ingress_controller_check_success also does

and expected to have other metrics like this

need a total cleanup through whole registry like socket collector does https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/metric/collectors/socket.go#L478

github-actions[bot] commented 1 year ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

longwuyuan commented 1 month ago

This is true. Another highlighted metric is about TLS certificates and their expiry date. Secrets get deleted but their metric does not get deleted.

However the project does not have resources to work on this at this scale and hence I will close this for now. If a contrbutor developer takes this up in future, this issue can be re-opened. We want to avoid open issues that do not track any action item. All resources are engaged in security & Gatway-API priorities.

/Close

k8s-ci-robot commented 1 month ago

@longwuyuan: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/10242#issuecomment-2349178624): >This is true. Another highlighted metric is about TLS certificates and their expiry date. Secrets get deleted but their metric does not get deleted. > >However the project does not have resources to work on this at this scale and hence I will close this for now. If a contrbutor developer takes this up in future, this issue can be re-opened. We want to avoid open issues that do not track any action item. All resources are engaged in security & Gatway-API priorities. > >/Close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 month ago

/kind bug

frittentheke commented 1 week ago

I observe the same issue as @horihel with the latest controller release.

@longwuyuan could you kindly reopen this issue or point me to any potential follow up issue / fix (as per your comment https://github.com/kubernetes/ingress-nginx/issues/10242#issuecomment-1650768095) that was done / worked on - I suppose for TLS / certs this is PR https://github.com/kubernetes/ingress-nginx/pull/9706 ?

Looking at https://github.com/kubernetes/ingress-nginx/blob/a8c62e22b72c68e4a829cc7954ff44b3d7d1dc1c/internal/ingress/metric/collectors/controller.go#L343-L346 it seems to be plenty of logic to do metric removal.

But apparently this does not yet cover the metrics about orphaned resourced introduced via https://github.com/kubernetes/ingress-nginx/issues/4763.

longwuyuan commented 1 week ago

@frittentheke there are no resources available to work on this. In case someone submits a PR, it may get reviewed depending on how informative the PR description and the details provided are.

frittentheke commented 1 week ago

@frittentheke there are no resources available to work on this. In case someone submits a PR, it may get reviewed depending on how informative the PR description and the details provided are.

That sound fair. But would you at least consider reopening this bug / issue? If it's an open issue it's much more attractive to pick and tackle ;-)

longwuyuan commented 1 week ago

I request that @horihel can reopen. Thank you for understanding.

horihel commented 1 week ago

I could request that, but probably cannot contribute anything substantial (except a few workarounds)

longwuyuan commented 1 week ago

the way to reopen is to type /reopen. thanks for understanding so please decide yourself. regards. the stale tls cert info also should go. i am cust curious that that on top right, there is a range for time so if somoene selects a time from past, then the data will be nil

horihel commented 1 week ago

/reopen

yes, if a metric goes missing, grafana/prometheus will revert to "no data" in that specific time range.

our current workaround is both dropping that metric altogether in prometheus scrape config and restarting ingress-nginx regularly

k8s-ci-robot commented 1 week ago

@horihel: Reopened this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/10242#issuecomment-2450093675): >/reopen > >yes, if a metric goes missing, grafana/prometheus will revert to "no data" in that specific time range. > >our current workaround is both dropping that metric altogether in prometheus scrape config and restarting ingress-nginx regularly Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.