Closed tony-liuliu closed 3 weeks ago
This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Have you tried testing the network in your cluster first? For example, without ingress-nginx
/remove-kind bug
Have you tried testing the network in your cluster first? For example, without ingress-nginx
Yes, very sure, accessing services except ingress-nginx is very normal. There is no such network stuck situation.
Test Results:
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n kubernetes-dashboard get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kubernetes-dashboard-api-949ddd7bb-6qzpp 1/1 Running 0 18h 10.244.32.42 dong-k8s-93 <none> <none>
kubernetes-dashboard-metrics-scraper-6c6c7b7cf4-5fk8r 1/1 Running 0 18h 10.244.32.38 dong-k8s-93 <none> <none>
kubernetes-dashboard-web-5476467fcc-vhcv7 1/1 Running 0 18h 10.244.32.36 dong-k8s-93 <none> <none>
[root@dong-k8s-90 ingress-nginx-controller]# time for i in `seq 1 1000`;do echo $i;curl -I http://10.244.32.42:9000/api/;done
......
1000
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Sat, 05 Aug 2023 03:11:21 GMT
Content-Length: 19
real 0m6.018s
user 0m2.126s
sys 0m3.329s
@tony-liuliu There is no answers to the questions asked in a issue template so everything you are saying here assumes that your clster and environment is 100% perfect in acceptable state. It also assumes that your installation of the ingress-nginx controller is 100% perfect. That does not work when a deep dive is required.
Please provide the details as asked in a new issue tmplate.
Have the same issue, latest helm chart. Everything else working beside ingress-nginx. Works sometimes, sometimes holds the connection open and nothing happens.
Will retort with the issue template questions later in the day.
any logs?
After my test today, I found that the reason why the nginx-controller network is intermittently stuck may be related to this:
The CPU configuration of the kvm virtual machine node running on the nginx-controller pod is 16 cores. I checked that the default configuration of worker-processes is auto. Normally, 16 worker processes are created, but only 13 are created here.
For example, when the default value of worker-processes is auto(16), the nginx-controller network will be stuck intermittently,Because after testing, I found that only 13 worker processes were actually created, which may be the main cause of the problem.
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx exec -it ingress-nginx-controller-7d6797bbcb-pgdj7 sh
/etc/nginx $ head 10 /etc/nginx/nginx.conf
head: 10: No such file or directory
==> /etc/nginx/nginx.conf <==
# Configuration checksum: 15638244883250834871
# setup custom paths that do not require root access
pid /tmp/nginx/nginx.pid;
daemon off;
worker_processes 16;
/etc/nginx $ ps -ef
PID USER TIME COMMAND
1 www-data 0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --vali
7 www-data 0:01 /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --validating-webhook-key=/us
33 www-data 0:00 nginx: master process /usr/bin/nginx -c /etc/nginx/nginx.conf
38 www-data 0:00 nginx: worker process
39 www-data 0:00 nginx: worker process
40 www-data 0:00 nginx: worker process
41 www-data 0:00 nginx: worker process
42 www-data 0:00 nginx: worker process
43 www-data 0:00 nginx: worker process
44 www-data 0:00 nginx: worker process
45 www-data 0:00 nginx: worker process
46 www-data 0:00 nginx: worker process
47 www-data 0:00 nginx: worker process
48 www-data 0:00 nginx: worker process
49 www-data 0:00 nginx: worker process
50 www-data 0:00 nginx: worker process
64 www-data 0:00 nginx: cache manager process
517 www-data 0:00 sh
536 www-data 0:00 ps -ef
/etc/nginx $ ps -ef|grep 'worker process'|grep -v grep|wc -l
13
When I manually adjust worker-processes to 13 or less, the network requests of worker-processes will be normal:
[root@dong-k8s-90 ingress-nginx-controller]# vim ingress-nginx-controller-1.8.1.yaml
......
---
apiVersion: v1
data:
allow-snippet-annotations: "true"
worker-processes: "13"
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
app.kubernetes.io/version: 1.8.1
name: ingress-nginx-controller
namespace: ingress-nginx
......
[root@dong-k8s-90 ingress-nginx-controller]# kubectl apply -f ingress-nginx-controller-1.8.1.yaml
namespace/ingress-nginx unchanged
serviceaccount/ingress-nginx unchanged
serviceaccount/ingress-nginx-admission unchanged
role.rbac.authorization.k8s.io/ingress-nginx unchanged
role.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
clusterrole.rbac.authorization.k8s.io/ingress-nginx unchanged
clusterrole.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
rolebinding.rbac.authorization.k8s.io/ingress-nginx unchanged
rolebinding.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx unchanged
clusterrolebinding.rbac.authorization.k8s.io/ingress-nginx-admission unchanged
configmap/ingress-nginx-controller configured
service/ingress-nginx-controller unchanged
service/ingress-nginx-controller-admission unchanged
deployment.apps/ingress-nginx-controller configured
job.batch/ingress-nginx-admission-create unchanged
job.batch/ingress-nginx-admission-patch unchanged
ingressclass.networking.k8s.io/nginx unchanged
validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission configured
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx rollout restart deployment ingress-nginx-controller
deployment.apps/ingress-nginx-controller restarted
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-admission-create-58w7p 0/1 Completed 0 28h 10.244.158.225 dong-k8s-95 <none> <none>
ingress-nginx-admission-patch-ctgjm 0/1 Completed 0 28h 10.244.158.226 dong-k8s-95 <none> <none>
ingress-nginx-controller-74597567dd-njqzp 1/1 Running 0 16s 10.244.158.232 dong-k8s-95 <none> <none>
[root@dong-k8s-90 ingress-nginx-controller]# kubectl -n ingress-nginx exec -it ingress-nginx-controller-74597567dd-njqzp sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/etc/nginx $ ps -ef
PID USER TIME COMMAND
1 www-data 0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --vali
7 www-data 0:01 /nginx-ingress-controller --election-id=ingress-nginx-leader --controller-class=k8s.io/ingress-nginx --ingress-class=nginx --configmap=ingress-nginx/ingress-nginx-controller --validating-webhook=:8443 --validating-webhook-certificate=/usr/local/certificates/cert --validating-webhook-key=/us
32 www-data 0:00 nginx: master process /usr/bin/nginx -c /etc/nginx/nginx.conf
37 www-data 0:00 nginx: worker process
38 www-data 0:00 nginx: worker process
39 www-data 0:00 nginx: worker process
40 www-data 0:00 nginx: worker process
41 www-data 0:00 nginx: worker process
42 www-data 0:00 nginx: worker process
43 www-data 0:00 nginx: worker process
44 www-data 0:00 nginx: worker process
45 www-data 0:00 nginx: worker process
46 www-data 0:00 nginx: worker process
47 www-data 0:00 nginx: worker process
48 www-data 0:00 nginx: worker process
49 www-data 0:00 nginx: worker process
50 www-data 0:00 nginx: cache manager process
53 www-data 0:00 nginx: cache loader process
468 www-data 0:00 sh
474 www-data 0:00 ps -ef
/etc/nginx $ ps -ef|grep 'worker process'|grep -v grep|wc -l
13
I tried to constantly adjust the value of worker-processes and found that as long as the value of worker-processes is consistent with the actual created worker process, there is no problem of intermittent network stuck.
any logs?
Not from a default config. I will look at adding debug options.
I exec onto the pod, and curl localhost. Sometimes it works, sometimes it hangs "forever". When it hands there are no logs from nginx.
ingress-nginx-controller-795cfcbd49-ljmrb:/etc/nginx$ curl localhost --haproxy-protocol
default backend - 404
ingress-nginx-controller-795cfcbd49-ljmrb:/etc/nginx$ curl localhost --haproxy-protocol
default backend - 404
ingress-nginx-controller-795cfcbd49-ljmrb:/etc/nginx$ curl localhost --haproxy-protocol
[ hangs forever ]
My setup is podman, kind, cilium, metallb, ingress-nginx. You can see the terraform here; happy to do a branch to untangle it for testing.
https://github.com/shutthegoatup/homelab/blob/main/terraform/02_network/main.tf
adegnan@millie:~$ uname -a
Linux millie 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-2 (2023-07-27) x86_64 GNU/Linux
adegnan@millie:~$ cat /etc/debian_version
12.1
adegnan@millie:~$ podman -v
podman version 4.3.1
adegnan@millie:~$ ~/go/bin/kind version
kind v0.20.0 go1.19.8 linux/amd64
adegnan@millie:~$ sudo podman network inspect kind
[
{
"name": "kind",
"id": "aefc801cc799bec72fd2f334124184f4c5d831be7500587c0365101129c87abd",
"driver": "bridge",
"network_interface": "podman1",
"created": "2023-08-05T21:10:42.102673742+01:00",
"subnets": [
{
"subnet": "192.168.2.0/24",
"gateway": "192.168.2.1",
"lease_range": {
"start_ip": "192.168.2.1",
"end_ip": "192.168.2.127"
}
}
],
"ipv6_enabled": false,
"internal": false,
"dns_enabled": true,
"ipam_options": {
"driver": "host-local"
}
}
]
k8s cluster
[adegnan@ub3r 02_network (chore/fix-ranges-disk-sizes)]$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"archive", BuildDate:"2023-07-20T07:37:53Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-15T00:36:28Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
[adegnan@ub3r 02_network (chore/fix-ranges-disk-sizes)]$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
main-control-plane Ready control-plane 36h v1.27.3 192.168.2.4 <none> Debian GNU/Linux 11 (bullseye) 6.1.0-10-amd64 containerd://1.7.1
main-worker Ready <none> 36h v1.27.3 192.168.2.5 <none> Debian GNU/Linux 11 (bullseye) 6.1.0-10-amd64 containerd://1.7.1
main-worker2 Ready <none> 36h v1.27.3 192.168.2.6 <none> Debian GNU/Linux 11 (bullseye) 6.1.0-10-amd64 containerd://1.7.1
main-worker3 Ready <none> 36h v1.27.3 192.168.2.3 <none> Debian GNU/Linux 11 (bullseye) 6.1.0-10-amd64 containerd://1.7.1
adegnan@millie:~$ kubectl describe ingressclasses
Name: nginx
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=ingress-nginx
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=ingress-nginx
app.kubernetes.io/part-of=ingress-nginx
app.kubernetes.io/version=1.8.1
helm.sh/chart=ingress-nginx-4.7.1
Annotations: ingressclass.kubernetes.io/is-default-class: true
meta.helm.sh/release-name: ingress-nginx
meta.helm.sh/release-namespace: kube-ingress
Controller: k8s.io/ingress-nginx
Events: <none>
adegnan@millie:~$ helm -n kube-ingress get values ingress-nginx
USER-SUPPLIED VALUES:
controller:
config:
force-ssl-redirect: true
use-proxy-protocol: true
extraArgs:
default-ssl-certificate: kube-ingress/wildcard-tls
ingressClassResource:
default: true
kind: Deployment
metrics:
enabled: true
serviceMonitor:
additionalLabels:
release: kube-prometheus-stack
enabled: true
replicas: 2
service:
externalTrafficPolicy: Local
type: LoadBalancer
defaultBackend:
enabled: true
I tried to constantly adjust the value of worker-processes and found that as long as the value of worker-processes is consistent with the actual created worker process, there is no problem of intermittent network stuck.
Can confirm. My config had worker-processes at 16, but the container only had 8. By fixing the setting the issue goes away.
If your issue can be solved by adjusting worker-processes, then you need to consider issues such as load, network card, interruption, etc.
This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev
on Kubernetes Slack.
I tried to constantly adjust the value of worker-processes and found that as long as the value of worker-processes is consistent with the actual created worker process, there is no problem of intermittent network stuck.
The same here. Problem occurs after bump nodes from 8 to 16vCPU. Setting worker process to 8 resolve problem.
For example, when the default value of worker-processes is auto(16), the nginx-controller network will be stuck intermittently,Because after testing, I found that only 13 worker processes were actually created, which may be the main cause of the problem.
I had 13 workers exactly as mention above.
my setup: Proxmox 8.0.3 10x VM 16cpu 16GB ram, Ubuntu 22.04.3 LTS K8S: v1.26.9+rke2r1 with cilium network plugin ingress-controller installed by Helm chart v4.9.1 (nginx version: nginx/1.21.6)
we are still hitting this, not sure why, but strangely the intermittent failures NEVER happen when we set a single replica count of 1
. Unfortunately cannot find any relevant explanation to this behaviour
Issue was solved so closing.
/close
@longwuyuan: Closing this issue.
@longwuyuan sorry, how it comes it was solved? can you please point us to the PR fixing this? Thank you! 💯
Adjusting workers as mentioned here https://github.com/kubernetes/ingress-nginx/issues/10276#issuecomment-1667560577
If it is not so, then kindly re-open the issue after posting the information that can be analyzed. Please use a kind cluster to reproduce the issue. Please use helm to install the controller and please provide the values file used to install the controller. You can also fork the project, create a branch and clone the branch locally. Then from root of local clone, you can run make dev-env
to create cluster automatically with controller installed. Then you can do your tests locally and provide all the commands you executed and all the manifests you used etc etc so that a reader here can reproduce just like your test. thanks
Problem phenomenon: After deploying the latest ingress-nginx-controller, requests to port 80 or 443 of the nginx-controller pod IP address will always be stuck, even if you enter the ingress-nginx-controller container and use curl 127.0.0.1, it will also get stuck Phenomenon, please help me to find out what the problem is.
All requests for non-ingress-nginx-controller services are running normally, including the health check port 10254 of the ingress-nginx-controller service.
Environmental information:
kubernetes version: 1.27.4
OS: CentOS : CentOS Linux release 7.9.2009 (Core)
Linux kernel: Linux dong-k8s-90 4.20.13-1.el7.elrepo.x86_64 #1 SMP Wed Feb 27 10:02:05 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
runtime: containerd://1.7.2
Install tools:
CNI: calico-3.26.1 using IPIP mode, Deployment manifest used https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml
How was the ingress-nginx-controller installed: ingress-nginx-controller version: v1.8.1 Deployment manifest used https://github.com/kubernetes/ingress-nginx/blob/main/deploy/static/provider/baremetal/deploy.yaml
Current State of the controller:
The following is the packet capture information when something goes wrong:
The client initiates a curl request
It has been stuck in this state and has not returned.
ps: Because the pod has been restarted, the IP address seen has changed and the information captured is different.
The request packet captured by the client
ingress-nginx-controller container network capture
It will cause the client to be stuck all the time. This frequency is very high Please help me to find out what is causing the problem.