ingress nginx controller keeps routing to old endpoint resulting in intermittent timeouts

vchan2002 commented 4 months ago

What happened:

Intermittent Upstream timed out when nginx is trying to talk to its downstream service. On those errors, the downstream URL it tries to use is the same.... even after recycling the downstream service/pods... So it seems that it stubbornly keeps trying to forward the requests to an old pod that's likely terminated due to a deployment.

84787 upstream timed out (110: Operation timed out) while connecting to upstream, ${URL} is always the same...

The only way to make nginx "forget" that old upstream URL is to drain/delete the node that the previous pod/IP address is assigned to...

What you expected to happen:

After a new deployment with a ingress-nginx ingress, we expect the nginx controller to know what the new downstreams are and reconfig itself accordingly.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): NGINX Ingress controller Release: v1.9.1 Build: 3538107c077f1bd860d448e19f44fc8e6a2729e1 Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.21.6

Kubernetes version (use kubectl version):

v1.28.9-eks-036c24b

Environment:

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
- Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
Basic cluster related info:
- kubectl version
- kubectl get nodes -o wide
How was the ingress-nginx-controller installed: ingress-nginx kube-system 34 2024-06-17 15:43:59.284338504 +0000 UTC deployed ingress-nginx-4.10.1 1.10.1
Current State of the controller: Name: nginx Labels: app.kubernetes.io/component=controller app.kubernetes.io/instance=ingress-nginx app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=ingress-nginx app.kubernetes.io/part-of=ingress-nginx app.kubernetes.io/version=1.10.1 helm.sh/chart=ingress-nginx-4.10.1 Annotations: ingressclass.kubernetes.io/is-default-class: true meta.helm.sh/release-name: ingress-nginx meta.helm.sh/release-namespace: kube-system Controller: k8s.io/ingress-nginx Events:

k8s-ci-robot commented 4 months ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 4 months ago

@vchan2002 the info you provided can not be analyzed. Your issue description can just be noted as your observation.

Click the button to create a new bug report and look at the questions asked in the template there.
Edit this issue description and provide the answers to those questions. That will be data that readers can analyze.
Ensure that your issue description is formatted as per markdown.

If the kubelet of a node does not update the api-server about a pod going away, then the controller also can not update its own endpointslice.

/remove-kind bug /kind support /triage needs-informtion

k8s-ci-robot commented 4 months ago

@longwuyuan: The label(s) triage/needs-informtion cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/11562#issuecomment-2211583708): >@vchan2002 the info you provided can not be analyzed. Your issue description can just be noted as your observation. > >- Click the button to create a new bug report and look at the questions asked in the template there. >- Edit this issue description and provide the answers to those questions. That will be data that readers can analyze. >- Ensure that your issue description is formatted as per markdown. > >If the kubelet of a node does not update the api-server about a pod going away, then the controller also can not update its own endpointslice. > >/remove-kind bug >/kind support >/triage needs-informtion Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 4 months ago

/triage needs-information

vchan2002 commented 4 months ago

So, while I am trying to gather some info, I do have to ask....

What can be a cause that would make kubelet not communicate that a pod went away?

This has happened to one of our specific environment, in one specific ingress, twice in the past two weeks.... So it's not incidental in any way..... It just seems very odd that this is happening like this....

github-actions[bot] commented 3 months ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

kubernetes / ingress-nginx

ingress nginx controller keeps routing to old endpoint resulting in intermittent timeouts #11562