Closed nrobert13 closed 2 years ago
@nrobert13: This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Hi @nrobert13 ,
If I use this doc https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/ on a kind cluster, will I be able to reproduce this problem ?
/remove-kind bug
/kind support
@longwuyuan , thanks for the reply. You don't need to setup the prometheus/grafana stack for this. if you shell into the nginx controller you can pull the metrics, see my snippet in the What happened
section.
Hi @nrobert13 ,
I think its caused by resource utilisation or locks/delays on your environment. I am unable to reproduce the problem ;
@longwuyuan thanks for looking into this. I suspect the behaviour is related to the persistent connections ( see my ingress resource snippet ). The clients are opening a keep-alive connection to nginx, and nginx keeps them alive towards the upstream ( backend ) with the nginx.ingress.kubernetes.io/proxy-read-timeout: "14400"
. At the time the connection count bumps, we see big amount of RESET
's in the upstream ( backend ) service, which makes me think, that these resets are not counted by nginx metrics, and the newly created connections are just added on top.
I agree. What you describe (keepalives) and multiple other use-cases will not have been instrumented into the metrics I think. It looks a deep dive will be required and a appropriate instrumentation for managing such custom configs will need to be developed.
The ingress-nginx project does not have enough resources to do this kind of development now. Would be interested in submitting a PR on this. I am not a developer so hard for me to deep dive into this.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
Kubernetes version (use
kubectl version
):Environment:
Cloud provider or hardware configuration: GCP
OS (e.g. from /etc/os-release): COS
Kernel (e.g.
uname -a
):Linux ingress-nginx-machines-controller-7f54d7c564-zv5fp 5.4.144+ #1 SMP Sat Sep 25 09:56:01 PDT 2021 x86_64 Linux
How was the ingress-nginx-controller installed:
kubectl -n <appnnamespace> get all,ing -o wide
kubectl -n <appnamespace> describe ing <ingressname>
What happened:
The prometheus metrics are reporting more connections than the actual number reported by
netstat
. In this case it's an ~45% overflow,7700
reported by netstat and nginx reports13400
.What you expected to happen:
Expected to be roughly the same numbers reported by the active connections and netstat.
How to reproduce it:
Anything else we need to know:
We use persistent connections ( ~18k ) for SSE, and under some circumstances the number of connections reported by nginx is being bumped by roughly the same amount of connections ( ~18k ) although this increase cannot be observed in the netstat output. A restart of the deployment solves the problem, and the metrics are aligned to what netstat reports.