nginx_ingress_controller_orphan_ingress accumulates very many series over time

horihel commented 1 year ago

What happened: See Screenshot:

grafik

We've observed prometheus gradually use more and more memory over time - after some inspection we found that nginx_ingress_controller_orphan_ingress exports a really large amount of labels constantly - even for namespaces that don't exist any more for quite a while.

This cluster might be a bit of a special case as it creates/destroys namespaces with a 10-20 ingresses constantly to run tests.

It's easy to see that this adds up and this number does not go down (unless the nginx-pods are killed): grafik

What you expected to happen:

If I understand correctly, labels are usually kept on /metrics, but in this case (and maybe others), it might be worth considering not exporting the metric any more if the ingress has been deleted.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): This is rke2-ingress-nginx as shipped with rke2 v1.25.11+rke2r1.

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       nginx-1.6.4-hardened4
  Build:         git-90e1717ce
  Repository:    https://github.com/rancher/ingress-nginx.git
  nginx version: nginx/1.21.4

-------------------------------------------------------------------------------

I'm not sure if this version is a fork or vendored by Rancher, but glancing at the code it looks like orphans aren't removed in current mainline 1.8.1 too. I didn't test that yet though (sorry).

Kubernetes version (use kubectl version): v1.25.11+rke2r1

Environment: rke2 managed by rancher on vSphere

Cloud provider or hardware configuration: vSphere
OS (e.g. from /etc/os-release): Ubuntu 22.02
Kernel (e.g. uname -a): Linux rke2-ingress-nginx-controller-f7c4b 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Install tools: rancher v2.7.4, rke2 all defaults, ServiceMonitors enabled via helmChartConfig.
- Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.

Basic cluster related info:

kubectl version:

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"windows/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.11+rke2r1", GitCommit:"8cfcba0b15c343a8dc48567a74c29ec4844e0b9e", GitTreeState:"clean", BuildDate:"2023-06-14T21:31:34Z", GoVersion:"go1.19.10 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

kubectl get nodes -o wide:

NAME                                           STATUS   ROLES                       AGE   VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-development-mgmt-a532ef00-447zr            Ready    control-plane,etcd,master   12d   v1.25.11+rke2r1   10.240.180.85   10.240.180.85   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-mgmt-a532ef00-n9rqx            Ready    control-plane,etcd,master   12d   v1.25.11+rke2r1   10.240.180.84   10.240.180.84   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-mgmt-a532ef00-wpxb5            Ready    control-plane,etcd,master   12d   v1.25.11+rke2r1   10.240.180.83   10.240.180.83   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-workers-6c24g-a8ecb429-kdphr   Ready    worker                      38d   v1.25.11+rke2r1   10.240.180.77   10.240.180.77   Ubuntu 22.04.2 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-workers-6c24g-a8ecb429-kfmlp   Ready    worker                      95d   v1.25.11+rke2r1   10.240.180.69   10.240.180.69   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-workers-6c24g-a8ecb429-mh9hx   Ready    worker                      25d   v1.25.11+rke2r1   10.240.180.81   10.240.180.81   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-workers-6c24g-a8ecb429-t54ww   Ready    worker                      32d   v1.25.11+rke2r1   10.240.180.78   10.240.180.78   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-workers-6c24g-a8ecb429-w5xc5   Ready    worker                      23d   v1.25.11+rke2r1   10.240.180.82   10.240.180.82   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1
k8s-development-workers-6c24g-a8ecb429-zhm6p   Ready    worker                      95d   v1.25.11+rke2r1   10.240.180.70   10.240.180.70   Ubuntu 22.04.1 LTS   5.15.0-76-generic   containerd://1.7.1-k3s1

How was the ingress-nginx-controller installed:

If helm was used then please show output of helm ls -A | grep -i ingress

rke2-ingress-nginx                                      kube-system             12              2023-07-24 08:42:36.043793919 +0000 UTC         deployed        rke2-ingress-nginx-4.5.201
                        1.6.4

If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>

controller:
metrics:
enabled: true
serviceMonitor:
enabled: true
global:
clusterCIDR: 10.42.0.0/16
clusterCIDRv4: 10.42.0.0/16
clusterDNS: 10.43.0.10
clusterDomain: cluster.local
rke2DataDir: /var/lib/rancher/rke2
serviceCIDR: 10.43.0.0/16

Current State of the controller:

kubectl describe ingressclasses

Name:         nginx
Labels:       app.kubernetes.io/component=controller
      app.kubernetes.io/instance=rke2-ingress-nginx
      app.kubernetes.io/managed-by=Helm
      app.kubernetes.io/name=rke2-ingress-nginx
      app.kubernetes.io/part-of=rke2-ingress-nginx
      app.kubernetes.io/version=1.6.4
      helm.sh/chart=rke2-ingress-nginx-4.5.201
Annotations:  meta.helm.sh/release-name: rke2-ingress-nginx
      meta.helm.sh/release-namespace: kube-system
Controller:   k8s.io/ingress-nginx
Events:       <none>

kubectl -n <ingresscontrollernamespace> get all -A -o wide
this is a bit large as this commands will list all cluster resources... (-A)
kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>

rke2 installs ingress-nginx as daemonset - so this produces too much output - here's a single one:

Name:             rke2-ingress-nginx-controller-tkrnv
Namespace:        kube-system
Priority:         0
Service Account:  rke2-ingress-nginx
Node:             k8s-development-workers-6c24g-a8ecb429-zhm6p/10.240.180.70
Start Time:       Mon, 08 May 2023 14:56:51 +0200
Labels:           app.kubernetes.io/component=controller
          app.kubernetes.io/instance=rke2-ingress-nginx
          app.kubernetes.io/name=rke2-ingress-nginx
          controller-revision-hash=6844f6f4b8
          pod-template-generation=5
Annotations:      cni.projectcalico.org/containerID: 2ba3ae616e3360a86663711c9a643fe810c02d5ebb92278eea5f146be969974b
          cni.projectcalico.org/podIP: 10.42.166.28/32
          cni.projectcalico.org/podIPs: 10.42.166.28/32
Status:           Running
IP:               10.42.166.28
IPs:
IP:           10.42.166.28
Controlled By:  DaemonSet/rke2-ingress-nginx-controller
Containers:
rke2-ingress-nginx-controller:
Container ID:  containerd://90ae21ac8a1c87a5387f45fde869bf498af033157df3b3f28c507767ec5cc38b
Image:         rancher/nginx-ingress-controller:nginx-1.6.4-hardened4
Image ID:      docker.io/rancher/nginx-ingress-controller@sha256:7804101a5cb8de407b1192e42ea0d6153ac2a71eb1765f63ca4af60a1dbe46f3
Ports:         80/TCP, 443/TCP, 10254/TCP, 8443/TCP
Host Ports:    80/TCP, 443/TCP, 0/TCP, 0/TCP
Args:
/nginx-ingress-controller
--election-id=rke2-ingress-nginx-leader
--controller-class=k8s.io/ingress-nginx
--ingress-class=nginx
--configmap=$(POD_NAMESPACE)/rke2-ingress-nginx-controller
--validating-webhook=:8443
--validating-webhook-certificate=/usr/local/certificates/cert
--validating-webhook-key=/usr/local/certificates/key
--watch-ingress-without-class=true
State:          Running
Started:      Fri, 30 Jun 2023 10:31:24 +0200
Last State:     Terminated
Reason:       Unknown
Exit Code:    255
Started:      Sat, 17 Jun 2023 09:25:31 +0200
Finished:     Fri, 30 Jun 2023 10:30:32 +0200
Ready:          True
Restart Count:  5
Requests:
cpu:      100m
memory:   90Mi
Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME:       rke2-ingress-nginx-controller-tkrnv (v1:metadata.name)
POD_NAMESPACE:  kube-system (v1:metadata.namespace)
LD_PRELOAD:     /usr/local/lib/libmimalloc.so
Mounts:
/usr/local/certificates/ from webhook-cert (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lhqr4 (ro)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
webhook-cert:
Type:        Secret (a volume populated by a Secret)
SecretName:  rke2-ingress-nginx-admission
Optional:    false
kube-api-access-lhqr4:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                     node.kubernetes.io/not-ready:NoExecute op=Exists
                     node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                     node.kubernetes.io/unreachable:NoExecute op=Exists
                     node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type    Reason  Age                   From                      Message
----    ------  ----                  ----                      -------
Normal  RELOAD  16m (x3612 over 24d)  nginx-ingress-controller  NGINX reload triggered due to a change in configuration

kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>

Current state of ingress object, if applicable:
- kubectl -n <appnnamespace> get all,ing -o wide
- kubectl -n <appnamespace> describe ing <ingressname>
- If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
Others:
- Any other related information like ;
- copy/paste of the snippet (if applicable)
- kubectl describe ... of any custom configmap(s) created and in use
- Any other related information that may help

How to reproduce this issue:

Create a namespace with a few ingresses (doesn't matter if orphaned or not) Delete the namespace Observe metrics for ingresses stay in /metrics, and status for orphaned also stays in there

Anything else we need to know:

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 1 year ago

Your observation is correct. This happens in case of sects of type TLS as well. This has to do with the design that a object may no longer exist but the timeseries has data from the time that the object did exist.

There are not many resources to work on this issue. If you want to submit a PR, it will be welcome. For the similar problem with deleted TLS secrets being shown in graphs, I think there was a PR to change the promquery itself.

/remove-kind bug /kind feature

horihel commented 1 year ago

I'm afraid golang is out of my league (for now). We've dropped the metric in the prometheus config and RAM usage has stabilized.

Revolution1 commented 1 year ago

Also encountered the same issue, too many metrics of historical ingresses crashed my prometheus, took more than 30Gi mem to process

I'll try and see if I can make a pr about this

Revolution1 commented 1 year ago

related: #8230 @alex123012

Revolution1 commented 1 year ago

well I found orphan_ingress is not the only metric that accumulates over time nginx_ingress_controller_check_success also does

and expected to have other metrics like this

need a total cleanup through whole registry like socket collector does https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/metric/collectors/socket.go#L478

github-actions[bot] commented 1 year ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

longwuyuan commented 2 months ago

This is true. Another highlighted metric is about TLS certificates and their expiry date. Secrets get deleted but their metric does not get deleted.

However the project does not have resources to work on this at this scale and hence I will close this for now. If a contrbutor developer takes this up in future, this issue can be re-opened. We want to avoid open issues that do not track any action item. All resources are engaged in security & Gatway-API priorities.

/Close

k8s-ci-robot commented 2 months ago

@longwuyuan: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/10242#issuecomment-2349178624): >This is true. Another highlighted metric is about TLS certificates and their expiry date. Secrets get deleted but their metric does not get deleted. > >However the project does not have resources to work on this at this scale and hence I will close this for now. If a contrbutor developer takes this up in future, this issue can be re-opened. We want to avoid open issues that do not track any action item. All resources are engaged in security & Gatway-API priorities. > >/Close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 2 months ago

/kind bug

frittentheke commented 3 weeks ago

I observe the same issue as @horihel with the latest controller release.

@longwuyuan could you kindly reopen this issue or point me to any potential follow up issue / fix (as per your comment https://github.com/kubernetes/ingress-nginx/issues/10242#issuecomment-1650768095) that was done / worked on - I suppose for TLS / certs this is PR https://github.com/kubernetes/ingress-nginx/pull/9706 ?

Looking at https://github.com/kubernetes/ingress-nginx/blob/a8c62e22b72c68e4a829cc7954ff44b3d7d1dc1c/internal/ingress/metric/collectors/controller.go#L343-L346 it seems to be plenty of logic to do metric removal.

But apparently this does not yet cover the metrics about orphaned resourced introduced via https://github.com/kubernetes/ingress-nginx/issues/4763.

longwuyuan commented 3 weeks ago

@frittentheke there are no resources available to work on this. In case someone submits a PR, it may get reviewed depending on how informative the PR description and the details provided are.

frittentheke commented 3 weeks ago

@frittentheke there are no resources available to work on this. In case someone submits a PR, it may get reviewed depending on how informative the PR description and the details provided are.

That sound fair. But would you at least consider reopening this bug / issue? If it's an open issue it's much more attractive to pick and tackle ;-)

longwuyuan commented 3 weeks ago

I request that @horihel can reopen. Thank you for understanding.

horihel commented 3 weeks ago

I could request that, but probably cannot contribute anything substantial (except a few workarounds)

longwuyuan commented 3 weeks ago

the way to reopen is to type /reopen. thanks for understanding so please decide yourself. regards. the stale tls cert info also should go. i am cust curious that that on top right, there is a range for time so if somoene selects a time from past, then the data will be nil

horihel commented 3 weeks ago

/reopen

yes, if a metric goes missing, grafana/prometheus will revert to "no data" in that specific time range.

our current workaround is both dropping that metric altogether in prometheus scrape config and restarting ingress-nginx regularly

k8s-ci-robot commented 3 weeks ago

@horihel: Reopened this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/10242#issuecomment-2450093675): >/reopen > >yes, if a metric goes missing, grafana/prometheus will revert to "no data" in that specific time range. > >our current workaround is both dropping that metric altogether in prometheus scrape config and restarting ingress-nginx regularly Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / ingress-nginx

nginx_ingress_controller_orphan_ingress accumulates very many series over time #10242