Open James-Quigley opened 3 days ago
This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
I'm also happy to share some extra evidence privately. But I was able to:
The log volume shows a massive drop off in requests right as the rollout happens (which I would expect). However, a small tail of requests continues to the old pods, despite the fact that they don't exist (or at least aren't used by the app in question anymore)
/remove-kind bug /triage needs-information
I think that the user her https://github.com/kubernetes/ingress-nginx/issues/11508 also reports the same behavior..
While we wait for other comments and maybe some expert opinion, I made an assumption that in a multi-replica use-case, the routing and loadbalancing of traffic will be the same regardless of the backend-protocol from controller to backend being gRPC or HTTP. Let me know if my assumption is wrong. It will take me too long to dive into the code and ascertain this.
So I tested with a 5 replica deployment using image nginx:alpine and generating at least 1 new connection per second with multiple sessions, to that deployment's HTTP backend-protocol ingress.
I could not reproduce the problem indicated in these 2 issues. I did see a transition related response code though.
So I am inclined to think that off-beat connection requesting client needs to handle the transitions better. It seems like such client have a myopic view that neither they use affinity/persistence nor do they accommodate the transition related events like graceful-drain of connections and route to another backend pod.
In any case, it will help a lot to get a step-by-step guide to reproduce the problem.
What happened:
The setup:
What you expected to happen: I would expect the workers to gracefully try to terminate the long lived connections, and to continue to route to valid targets in the meantime.
Instead, I find that grpc clients start to get UNAVAILABLE or UNIMPLEMENTED errors, which I presume is from traffic being routed by the old workers to IPs that don't exist anymore (or are assigned to different pods)
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): v1.8.1
Kubernetes version (use
kubectl version
): 1.27Environment: EKS
Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release): Bottlerocket
Kernel (e.g.
uname -a
): 5.15.160Install tools: AWS EKS
Basic cluster related info: v1.27
How was the ingress-nginx-controller installed:
ingress-nginx: controller: image: registry: registry.k8s.io priorityClassName: cluster-core-services kind: Deployment
admissionWebhooks: patch: image: registry: registry.k8s.io