Open horihel opened 1 year ago
This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Your observation is correct. This happens in case of sects of type TLS as well. This has to do with the design that a object may no longer exist but the timeseries has data from the time that the object did exist.
There are not many resources to work on this issue. If you want to submit a PR, it will be welcome. For the similar problem with deleted TLS secrets being shown in graphs, I think there was a PR to change the promquery itself.
/remove-kind bug /kind feature
I'm afraid golang is out of my league (for now). We've dropped the metric in the prometheus config and RAM usage has stabilized.
Also encountered the same issue, too many metrics of historical ingresses crashed my prometheus, took more than 30Gi mem to process
I'll try and see if I can make a pr about this
related: #8230 @alex123012
well I found orphan_ingress
is not the only metric that accumulates over time
nginx_ingress_controller_check_success
also does
and expected to have other metrics like this
need a total cleanup through whole registry like socket collector does https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/metric/collectors/socket.go#L478
This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev
on Kubernetes Slack.
This is true. Another highlighted metric is about TLS certificates and their expiry date. Secrets get deleted but their metric does not get deleted.
However the project does not have resources to work on this at this scale and hence I will close this for now. If a contrbutor developer takes this up in future, this issue can be re-opened. We want to avoid open issues that do not track any action item. All resources are engaged in security & Gatway-API priorities.
/Close
@longwuyuan: Closing this issue.
/kind bug
I observe the same issue as @horihel with the latest controller release.
@longwuyuan could you kindly reopen this issue or point me to any potential follow up issue / fix (as per your comment https://github.com/kubernetes/ingress-nginx/issues/10242#issuecomment-1650768095) that was done / worked on - I suppose for TLS / certs this is PR https://github.com/kubernetes/ingress-nginx/pull/9706 ?
Looking at https://github.com/kubernetes/ingress-nginx/blob/a8c62e22b72c68e4a829cc7954ff44b3d7d1dc1c/internal/ingress/metric/collectors/controller.go#L343-L346 it seems to be plenty of logic to do metric removal.
But apparently this does not yet cover the metrics about orphaned resourced introduced via https://github.com/kubernetes/ingress-nginx/issues/4763.
@frittentheke there are no resources available to work on this. In case someone submits a PR, it may get reviewed depending on how informative the PR description and the details provided are.
@frittentheke there are no resources available to work on this. In case someone submits a PR, it may get reviewed depending on how informative the PR description and the details provided are.
That sound fair. But would you at least consider reopening this bug / issue? If it's an open issue it's much more attractive to pick and tackle ;-)
I request that @horihel can reopen. Thank you for understanding.
I could request that, but probably cannot contribute anything substantial (except a few workarounds)
the way to reopen is to type /reopen. thanks for understanding so please decide yourself. regards. the stale tls cert info also should go. i am cust curious that that on top right, there is a range for time so if somoene selects a time from past, then the data will be nil
/reopen
yes, if a metric goes missing, grafana/prometheus will revert to "no data" in that specific time range.
our current workaround is both dropping that metric altogether in prometheus scrape config and restarting ingress-nginx regularly
@horihel: Reopened this issue.
What happened: See Screenshot:
We've observed prometheus gradually use more and more memory over time - after some inspection we found that
nginx_ingress_controller_orphan_ingress
exports a really large amount of labels constantly - even for namespaces that don't exist any more for quite a while.This cluster might be a bit of a special case as it creates/destroys namespaces with a 10-20 ingresses constantly to run tests.
It's easy to see that this adds up and this number does not go down (unless the nginx-pods are killed):
What you expected to happen:
If I understand correctly, labels are usually kept on /metrics, but in this case (and maybe others), it might be worth considering not exporting the metric any more if the ingress has been deleted.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): This is rke2-ingress-nginx as shipped with rke2 v1.25.11+rke2r1.
I'm not sure if this version is a fork or vendored by Rancher, but glancing at the code it looks like orphans aren't removed in current mainline 1.8.1 too. I didn't test that yet though (sorry).
Kubernetes version (use
kubectl version
): v1.25.11+rke2r1Environment: rke2 managed by rancher on vSphere
Cloud provider or hardware configuration: vSphere
OS (e.g. from /etc/os-release): Ubuntu 22.02
Kernel (e.g.
uname -a
): Linux rke2-ingress-nginx-controller-f7c4b 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/LinuxInstall tools: rancher v2.7.4, rke2 all defaults, ServiceMonitors enabled via helmChartConfig.
Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
Basic cluster related info:
kubectl version
:kubectl get nodes -o wide
:How was the ingress-nginx-controller installed:
helm ls -A | grep -i ingress
helm -n <ingresscontrollernamepspace> get values <helmreleasename>
Current State of the controller:
kubectl describe ingressclasses
kubectl -n <ingresscontrollernamespace> get all -A -o wide
kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Current state of ingress object, if applicable:
kubectl -n <appnnamespace> get all,ing -o wide
kubectl -n <appnamespace> describe ing <ingressname>
Others:
kubectl describe ...
of any custom configmap(s) created and in useHow to reproduce this issue:
Create a namespace with a few ingresses (doesn't matter if orphaned or not) Delete the namespace Observe metrics for ingresses stay in /metrics, and status for orphaned also stays in there
Anything else we need to know: