Healthz not accurately reporting container health

hassenius commented 1 year ago

What happened: We currently have intermittent issues where ingress controllers stop accepting new connections (on 80 and 443 ports), so the controllers are by definition not in a healthy state, but the healtz enpoint works just fine.

if I exec into a running nginx controller, you can see this very clearly

bash-5.1$ curl -m 10 http://localhost
curl: (28) Operation timed out after 10001 milliseconds with 0 bytes received
bash-5.1$ curl http://localhost:10254/healthz && echo
ok

What you expected to happen: Health checks to fail when pod is not healthy

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

bash-5.1$ ./nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.1.2
  Build:         bab0fbab0c1a7c3641bd379f27857113d574d904
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version):

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.7", GitCommit:"42c05a547468804b2053ecf60a3bd15560362fc2", GitTreeState:"clean", BuildDate:"2022-05-24T12:24:41Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.25) and server (1.23) exceeds the supported minor version skew of +/-1

Environment:

Cloud provider or hardware configuration: bare metal, on-prem
OS (e.g. from /etc/os-release): RHEL 8.6
Kernel (e.g. uname -a): 4.18.0-372.9.1.el8.x86_64
Install tools: kubespray

Current state of ingress object, if applicable:


NAME                                            READY   STATUS    RESTARTS   AGE   IP           NODE                                     NOMINATED NODE   READINESS GATES
pod/ingress-nginx-controller-5c5cfcb869-2fvp5   1/1     Running   0          9d    10.9.51.5    k8s-c001-worker02.<snip>   <none>           <none>
pod/ingress-nginx-controller-5c5cfcb869-dwjj2   1/1     Running   0          9d    10.9.61.15   k8s-c001-worker09.<snip>   <none>           <none>
pod/ingress-nginx-controller-5c5cfcb869-fpssw   1/1     Running   0          9d    10.9.5.8     k8s-c001-worker04.<snip>   <none>           <none>

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/ingress-nginx-controller NodePort 10.9.252.114 80:32080/TCP,443:32443/TCP 9d app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx service/ingress-nginx-controller-admission ClusterIP 10.9.244.170 443/TCP 9d app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx service/ingress-nginx-controller-metrics ClusterIP 10.9.253.239 10254/TCP 9d app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR deployment.apps/ingress-nginx-controller 3/3 3 3 9d controller docker-local./ingress-nginx/controller:v1.1.2@sha256:6a18680809f9bdf7bba4092cede2b5f3ee3566fb65c1bf4cfce328c5ee94bac7 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR replicaset.apps/ingress-nginx-controller-5c5cfcb869 3 3 3 9d controller docker-local./ingress-nginx/controller:v1.1.2@sha256:6a18680809f9bdf7bba4092cede2b5f3ee3566fb65c1bf4cfce328c5ee94bac7 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5c5cfcb869


**How to reproduce this issue**:
We are currently _not_ able to reproduce the issue at will, but it _does_ happen at least one time a few days after a new cluster is created. Restarting the affected pods (generally 2 out of 3 pods) seem to alleviate the problem for at least a high number of days.
When the described state is happening, there are a high number of connections (between 6-17 000) in CLOSE_WAIT

netstat -a -n | grep CLOSE_WAIT | wc -l 11735



**Anything else we need to know**:

k8s-ci-robot commented 1 year ago

@hassenius: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 1 year ago

hi @hassenius ,

Thanks for reporting this. I think not being able to reproduce at will means you will have to gather info on a bunch of host related resources ranging from filehandles, inodes, i/o, hardware status with dmesg etc etc. And also post kubectl get events -A taken at the time this occurs.

Another observation is that while several people run the service of the controller as type nodePort, the official kubernetes.io docs itself does not recommend nodePort for production use. At this stage there is no info from your cluster to base this statement on, but the relevance is host/other resources you could be exhausting (and maybe contributed by using nodePort)

hassenius commented 1 year ago

Thanks for the attention @longwuyuan .

I think there's elements in this

1) What is the root cause of this state occurring? (Since we have not identified any good way of reproducing this at will, it's quite difficult to do the forensic work, and any suggestions on what to capture when the situation does arise is very welcome. Maybe I'm best opening a separate issue for this?)

2) Why does the liveness probe not pick up the problem, letting kubelet restart the pod (which does resolve the issue when it occurs) For the purpose of this we have managed to isolate two ingress pods in a cluster that is stuck in this state (after being identified, we removed the owner reference, and relabelled them so that traffic is not sent to the broken pods).

longwuyuan commented 1 year ago

I think its not easy to get specific. But general direction is things like ;

is the webserver running
Has the webserver starved for filehandles to create new sockets
Does the webserver process register your curl for localhost and log it
Same analysis for your curl for healthz. Is your response from cache etc etc

Most significant factor here is nothing is known about your environment so can not generalize on anything. Everything from your hardware to your Virtualization to the linux kernel right down to any K8S or controller related config to the traffic can be at play. I already gave you one example. If you have a nodePort service for the controller, do you even know how many connections are related to that Port on the host TCP/IP stack etc etc

hassenius commented 1 year ago

Thanks,

First to clarify the comment about the NodePort recommendation. Do you mean not to use hostport or not to use nodeport? We use nodeport, and I'm not sure if we have many good alternatives, as the connection between the external loadbalancer and the ingress is orchestrated by an external tool before the cluster is created. I thought this was a pretty common configuration as mentioned in the ingress doc

An Ingress does not expose arbitrary ports or protocols. Exposing services other than HTTP and HTTPS to the internet typically uses a service of type [Service.Type=NodePort](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport) or [Service.Type=LoadBalancer](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer).

For the troubleshooting and further details on what we do see in the troubled pods ingress pods:

There are a very high number of connections stuck in CLOSE_WAIT
```
netstat -n | grep CLOSE_WAIT | wc -l
11820
```

netstat -n | grep -v CLOSE_WAIT | wc -l 81

(This seems to be the biggest symptom. Why all these connections stuck in CLOSE_WAIT forever?)

2. nginx is not able to accept connections to port 80

bash-5.1$ curl -m 5 -vvv http://localhost

Trying 127.0.0.1:80...
Trying ::1:80...
Connected to localhost (::1) port 80 (#0)

GET / HTTP/1.1 Host: localhost User-Agent: curl/7.79.1 Accept: /
Operation timed out after 5000 milliseconds with 0 bytes received
Closing connection 0 curl: (28) Operation timed out after 5000 milliseconds with 0 bytes received
```
Nothing is registered and logged when the pod is in the problem state
```

but health check seems ok (this is the same from within the pod, or from any other cluster node or pod)

bash-5.1$ curl -m 5 -vvv http://localhost:10254/healthz
*   Trying 127.0.0.1:10254...
* Connected to localhost (127.0.0.1) port 10254 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:10254
> User-Agent: curl/7.79.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Mon, 19 Sep 2022 18:21:06 GMT
< Content-Length: 2
<
* Connection #0 to host localhost left intact
*

It doesn't seem to be constrained in any particular way

 sysctl -a | grep conntrack_max
net.netfilter.nf_conntrack_max = 3145728

cat /proc/sys/fs/file-max
59185830

ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 2314720
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 4194304
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

It seems to happen on different hosts, and the hosts it happens on does not have any kind of other problem pods that we can see, outside the nginx pods.

longwuyuan commented 1 year ago

My suggestion is ;

Upgrade to app version 1.3.1 of the controller
Install metallb.org
Use default service type of LoadBalancer for the controller if helm is used or change to service type LoadBalancer if static manifest is used
Run only 1 replica and not 3 replicas
Don't change anything in the default configuration of the controller
Install prometheus + grafana or use any other monitoring tool and configure dashboards/metrics visualization for different states of connections like ESTABLISHED, TIME_WAIT, CLOSE_WAIT etc etc
Ensure visualization for file-handles, tcp metrics etc
After you have made above changes, please post the output of following commands here
- Copy/paste the exact command and the details of any flags etc as it is , that was used to install the controller, along with any values file if helm was used
- kubectl -n ingress-nginx get all
- kubectl -n ingress-nginx describe po ingress-nginx-controller-..........
- kubectl -n ingress-nginx describe svc ingress-nginx-controller
Show screenshots of day1 and day3 metrics for the CLOSE_WAIT or INODES or any other resource whose usage has spiked and remains high

/remove-kind bug /kind support

Observation based on your posts ;

We use nodeport, and I'm not sure if we have many good alternatives, as the connection between the external loadbalancer and the ingress is orchestrated by an external tool before the cluster is created. What does this mean, I don't understand. Obviously if you don't follow the kubernetes.io docs on setting up a service of type LoadBalancer, then you are on your own. In this case, there is a simple reality that your users https request needs to be sent to hit a ipaddress:30001 type of destination. But it looks like your clients send request to some unknown hostname or ipaddress and then that traffic is bounced/forwarded/redirected/re-routed/other to the ipaddress:30001 destination. This type of config has lots of implications. Those implications lie outside the scope of discussion on the ingress-nginx-controller . For example too many connections in CLOSE_WAIT state is not a topic of discussion under the scope of this controller project. It can be discussed only after a step-by-step procedure to describe a problem in the controller code is provided. Its too vast and too complicated

strongjz commented 1 year ago

Per @rikatz "https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/template/nginx.tmpl#L681 we need to check this lua code and this part of the code if it is locking connections"

hassenius commented 1 year ago

Sorry to be slow circling back to this. The issue still periodically bites us, but really, this particular issue report is really more about the health check not picking up the accurate health of the pod as one would expect, rather than the root cause of why it happens.

Here is an example

bash-5.1$ curl http://localhost:10254/healthz
okbash-5.1$ curl http://localhost:10254/healthz?verbose
[+]ping ok
[+]nginx-ingress-controller ok
healthz check passed
bash-5.1$ curl -m 1 http://localhost/healthz
curl: (28) Operation timed out after 1000 milliseconds with 0 bytes received

kubernetes / ingress-nginx

Healthz not accurately reporting container health #9067