Open granescb opened 2 weeks ago
Hi @granescb thanks for reporting!
Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this :slightly_smiling_face:
Cheers!
Hi @granescb,
Can you give more details about the node restarts. Are the nodes on a scheduled restart?
Can you try turn readOnlyRootFilesystem to false and let us know if this changes the behaviour.
In the meantime, we will do our best to reproduce the issue and get back as soon as we can.
Hello @AlexFenlon The node restart was related to cluster component update, so k8s drained all old nodes and migrated all pods to the new ones. We did this operation with 4 k8s clusters but got an ingress problem only for the biggest one. 2 of 3 pods were in CrashLoopBackOff status. The biggest cluster has +- 60 nodes and about 217 ingress resources - maybe it's related to the problem.
Yes, I can try readOnlyRootFilesystem, but only in the staging cluster. The main problem - I don't know how to reproduce the issue to check whether readOnlyRootFilesystem will solve the problem. I will try to reproduce the problem with the current settings and then try to do it with readOnlyRootFilesystem. UPD: During the update from 3.0.2 to 3.6.2 we set readOnlyRootFilesystem = true
I repeated the same behavior by sending 1 signal from the k8s worker node.
2024/11/06 09:35:56 [notice] 14#14: signal 1 (SIGHUP) received from 24, reconfiguring
2024/11/06 09:35:56 [notice] 14#14: reconfiguring
2024/11/06 09:35:56 [warn] 14#14: duplicate MIME type "text/html" in /etc/nginx/nginx.conf:28
2024/11/06 09:35:56 [notice] 14#14: using the "epoll" event method
2024/11/06 09:35:56 [notice] 14#14: start worker processes
2024/11/06 09:35:56 [notice] 14#14: start worker process 25
2024/11/06 09:35:56 [notice] 14#14: start worker process 26
2024/11/06 09:35:56 [notice] 14#14: start worker process 27
2024/11/06 09:35:56 [notice] 14#14: start worker process 28
2024/11/06 09:35:56 [notice] 16#16: gracefully shutting down
2024/11/06 09:35:56 [notice] 17#17: gracefully shutting down
2024/11/06 09:35:56 [notice] 18#18: gracefully shutting down
2024/11/06 09:35:56 [notice] 15#15: gracefully shutting down
2024/11/06 09:35:56 [notice] 17#17: exiting
2024/11/06 09:35:56 [notice] 18#18: exiting
2024/11/06 09:35:56 [notice] 15#15: exiting
2024/11/06 09:35:56 [notice] 16#16: exiting
2024/11/06 09:35:56 [notice] 16#16: exit
2024/11/06 09:35:56 [notice] 15#15: exit
2024/11/06 09:35:56 [notice] 17#17: exit
2024/11/06 09:35:56 [notice] 18#18: exit
I1106 09:35:56.455595 1 event.go:377] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"nginx-ingress", Name:"nginx-inc-ingress-controller", UID:"a61edba5-ec9b-4024-8649-d88d6d932178", APIVersion:"v1", ResourceVersion:"404400560", FieldPath:""}): type: 'Normal' reason: 'Updated' Configuration from nginx-ingress/nginx-inc-ingress-controller was updated
I1106 09:35:56.455660 1 event.go:377] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"<namespace>", Name:"<ingress_name>", UID:"2d658178-96ee-4290-a8cb-a04de49f3150", APIVersion:"networking.k8s.io/v1", ResourceVersion:"381554759", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for <namespace>/<ingress_name> was added or updated
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 17
2024/11/06 09:35:56 [notice] 14#14: worker process 17 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: signal 29 (SIGIO) received
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 16
2024/11/06 09:35:56 [notice] 14#14: worker process 16 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: signal 29 (SIGIO) received
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 18
2024/11/06 09:35:56 [notice] 14#14: worker process 18 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: worker process 15 exited with code 0
2024/11/06 09:35:56 [notice] 14#14: signal 29 (SIGIO) received
2024/11/06 09:35:56 [notice] 14#14: signal 17 (SIGCHLD) received from 15
E1106 09:39:46.190574 1 processes.go:39] unable to collect process metrics : unable to read file /proc/37/cmdline: open /proc/37/cmdline: no such file or directory
Now Pod is restarting and going to CrashLoopBackOff cause of busy sockets error. The question - who is sending 1 signal in production workload?
I also used readOnlyRootFilesystem=false and repeated the same case with 1 signal - now the pod is just restarting and working fine. So looks like this solution will work for us.
If you are happy, we will close this for now.
@AlexFenlon No, we wanna use this security feature but can't right now cause of this bug.
Also, looks like the same problem was reported about a year ago: https://github.com/nginxinc/kubernetes-ingress/issues/4604
Hi @granescb,
Thanks again for bringing this to our attention, we will investigate this again and get back to you.
Hi @granescb, we are looking into this. Are you using a particular type of node / machine OS?
Just noting that this bug was also reported here: https://github.com/nginxinc/kubernetes-ingress/issues/4370
Furthermore, I've incurred this issue again with release 3.7.1 because this particular release has increased the memory consumption of the ingress controller Pods - which led to OOM Kills - which led to this issue resurfacing on Pod restarts.
I'm going to be raising a separate issue concerning the memory consumption as I don't spot anyone else having done so yet
Version
3.6.2
What Kubernetes platforms are you running on?
EKS Amazon
Steps to reproduce
k8s EKS version: 1.31
Describe the bug: Sometimes, the nginx-ingress-controller restarts the process without cleaning the socket files. At first time we meet this problem during massive node restarting in the k8s cluster. Then it happens randomly on weekends.
The problem was noticed in version 3.6.2. Before we used app version 3.0.2 and never had this problem
Manual Pod deletion solves the problem, but it can happen again.
Here is deployment yaml
Logs with error:
Here are logs, containing 1 signal reconfiguring and then a crash loop with socket busy error Explore-logs-2024-11-05 18_40_57.txt
Expected behavior nginx-ingress controller pod is working.