kubernetes / ingress-nginx

Ingress-NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.18k stars 8.19k forks source link

Healthz not accurately reporting container health #9067

Open hassenius opened 1 year ago

hassenius commented 1 year ago

What happened: We currently have intermittent issues where ingress controllers stop accepting new connections (on 80 and 443 ports), so the controllers are by definition not in a healthy state, but the healtz enpoint works just fine.

if I exec into a running nginx controller, you can see this very clearly

bash-5.1$ curl -m 10 http://localhost
curl: (28) Operation timed out after 10001 milliseconds with 0 bytes received
bash-5.1$ curl http://localhost:10254/healthz && echo
ok

What you expected to happen: Health checks to fail when pod is not healthy

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

bash-5.1$ ./nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.1.2
  Build:         bab0fbab0c1a7c3641bd379f27857113d574d904
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version):

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.7", GitCommit:"42c05a547468804b2053ecf60a3bd15560362fc2", GitTreeState:"clean", BuildDate:"2022-05-24T12:24:41Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.25) and server (1.23) exceeds the supported minor version skew of +/-1

Environment:

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/ingress-nginx-controller NodePort 10.9.252.114 80:32080/TCP,443:32443/TCP 9d app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx service/ingress-nginx-controller-admission ClusterIP 10.9.244.170 443/TCP 9d app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx service/ingress-nginx-controller-metrics ClusterIP 10.9.253.239 10254/TCP 9d app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR deployment.apps/ingress-nginx-controller 3/3 3 3 9d controller docker-local./ingress-nginx/controller:v1.1.2@sha256:6a18680809f9bdf7bba4092cede2b5f3ee3566fb65c1bf4cfce328c5ee94bac7 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR replicaset.apps/ingress-nginx-controller-5c5cfcb869 3 3 3 9d controller docker-local./ingress-nginx/controller:v1.1.2@sha256:6a18680809f9bdf7bba4092cede2b5f3ee3566fb65c1bf4cfce328c5ee94bac7 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5c5cfcb869


**How to reproduce this issue**:
We are currently _not_ able to reproduce the issue at will, but it _does_ happen at least one time a few days after a new cluster is created. Restarting the affected pods (generally 2 out of 3 pods) seem to alleviate the problem for at least a high number of days.
When the described state is happening, there are a high number of connections (between 6-17 000) in CLOSE_WAIT

netstat -a -n | grep CLOSE_WAIT | wc -l 11735



**Anything else we need to know**:
k8s-ci-robot commented 1 year ago

@hassenius: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 year ago

hi @hassenius ,

Thanks for reporting this. I think not being able to reproduce at will means you will have to gather info on a bunch of host related resources ranging from filehandles, inodes, i/o, hardware status with dmesg etc etc. And also post kubectl get events -A taken at the time this occurs.

Another observation is that while several people run the service of the controller as type nodePort, the official kubernetes.io docs itself does not recommend nodePort for production use. At this stage there is no info from your cluster to base this statement on, but the relevance is host/other resources you could be exhausting (and maybe contributed by using nodePort)

hassenius commented 1 year ago

Thanks for the attention @longwuyuan .

I think there's elements in this

1) What is the root cause of this state occurring? (Since we have not identified any good way of reproducing this at will, it's quite difficult to do the forensic work, and any suggestions on what to capture when the situation does arise is very welcome. Maybe I'm best opening a separate issue for this?)

2) Why does the liveness probe not pick up the problem, letting kubelet restart the pod (which does resolve the issue when it occurs) For the purpose of this we have managed to isolate two ingress pods in a cluster that is stuck in this state (after being identified, we removed the owner reference, and relabelled them so that traffic is not sent to the broken pods).

longwuyuan commented 1 year ago

I think its not easy to get specific. But general direction is things like ;

Most significant factor here is nothing is known about your environment so can not generalize on anything. Everything from your hardware to your Virtualization to the linux kernel right down to any K8S or controller related config to the traffic can be at play. I already gave you one example. If you have a nodePort service for the controller, do you even know how many connections are related to that Port on the host TCP/IP stack etc etc

hassenius commented 1 year ago

Thanks,

First to clarify the comment about the NodePort recommendation. Do you mean not to use hostport or not to use nodeport? We use nodeport, and I'm not sure if we have many good alternatives, as the connection between the external loadbalancer and the ingress is orchestrated by an external tool before the cluster is created. I thought this was a pretty common configuration as mentioned in the ingress doc

An Ingress does not expose arbitrary ports or protocols. Exposing services other than HTTP and HTTPS to the internet typically uses a service of type [Service.Type=NodePort](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport) or [Service.Type=LoadBalancer](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer).

For the troubleshooting and further details on what we do see in the troubled pods ingress pods:

  1. There are a very high number of connections stuck in CLOSE_WAIT
    
    netstat -n | grep CLOSE_WAIT | wc -l
    11820

netstat -n | grep -v CLOSE_WAIT | wc -l 81

(This seems to be the biggest symptom. Why all these connections stuck in CLOSE_WAIT forever?)

2. nginx is not able to accept connections to port 80

bash-5.1$ curl -m 5 -vvv http://localhost

but health check seems ok (this is the same from within the pod, or from any other cluster node or pod)

bash-5.1$ curl -m 5 -vvv http://localhost:10254/healthz
*   Trying 127.0.0.1:10254...
* Connected to localhost (127.0.0.1) port 10254 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:10254
> User-Agent: curl/7.79.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Mon, 19 Sep 2022 18:21:06 GMT
< Content-Length: 2
<
* Connection #0 to host localhost left intact
* 

It doesn't seem to be constrained in any particular way

 sysctl -a | grep conntrack_max
net.netfilter.nf_conntrack_max = 3145728

cat /proc/sys/fs/file-max
59185830

ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 2314720
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 4194304
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

It seems to happen on different hosts, and the hosts it happens on does not have any kind of other problem pods that we can see, outside the nginx pods.

longwuyuan commented 1 year ago

My suggestion is ;

/remove-kind bug /kind support

Observation based on your posts ;

strongjz commented 1 year ago

Per @rikatz "https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/template/nginx.tmpl#L681 we need to check this lua code and this part of the code if it is locking connections"

hassenius commented 1 year ago

Sorry to be slow circling back to this. The issue still periodically bites us, but really, this particular issue report is really more about the health check not picking up the accurate health of the pod as one would expect, rather than the root cause of why it happens.

Here is an example

bash-5.1$ curl http://localhost:10254/healthz
okbash-5.1$ curl http://localhost:10254/healthz?verbose
[+]ping ok
[+]nginx-ingress-controller ok
healthz check passed
bash-5.1$ curl -m 1 http://localhost/healthz
curl: (28) Operation timed out after 1000 milliseconds with 0 bytes received