Closed ranjith-vatakkeel closed 5 days ago
This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Hello,
is Ingress NGINX fronted by a load balancer? Because 502 is a Bad Gateway and I assume your load balancer still having an Ingress NGINX Controller pod in its target list while it's already being shut down.
We recently had a discussion about this here: https://github.com/kubernetes/ingress-nginx/issues/11890
Even though this is AWS, it might still apply to your setup. Normally it could be useful to just increase the shutdown delay of Ingress NGINX by the amount of time your cloud load balancer takes to completely deregister a target.
Regards Marco
/remove-kind bug /kind support
I'm assuming your fronting your Nginx pods with an Azure Load Balancer. As already mentioned, you need to ensure that the AKS is able to detect that your pod is about to shutdown before it actually does.
I'm currently working with something similar in AKS, and we increased the shutdown timer in Nginx to 90s in addition to adjusting the various timeouts in the Azure Load Balancer.
If increasing the shutdown time does not work, let us know.
@toredash How will we ensure that Azure Load Balancer knows that the nginx pod is going to be terminated?
Does nginx have any HTTP endpoints that will tell ALB that the pod is getting killed?
IMO, If we mark the health endpoint as unhealthy before the actual shutdown procedure is initiated and then proceeded with the normal shutdown might help to deregister the targets much before pod is being killed.
( Sidenote : Haven't looked deeper how tied the health endpoint is to the shutdown process )
@toredash How will we ensure that Azure Load Balancer knows that the nginx pod is going to be terminated?
Does nginx have any HTTP endpoints that will tell ALB that the pod is getting killed?
IMO, If we mark the health endpoint as unhealthy before the actual shutdown procedure is initiated and then proceeded with the normal shutdown might help to deregister the targets much before pod is being killed.
( Sidenote : Haven't looked deeper how tied the health endpoint is to the shutdown process )
Once the Pod is instructed to terminate itself, its IP will be removed from the Endpoint list.
helm chart values that I've set specifically to tackle this problem:
controller:
extraArgs:
shutdown-grace-period: "90"
# Must be larger than the time it takes for an IP address in
# the Azure Load Balancer to be considered new and healthy
minReadySeconds: 30
service:
annotations: # https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#loadbalancer-annotations
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
service.beta.kubernetes.io/azure-disable-load-balancer-floating-ip: false
externalTrafficPolicy: Local # Ensure only nodes running Nginx is allowed to receive traffic
With this, I am able to restart my nginx deployment without causing any downtime
We are using the Nginx Ingress chart version v4.11.2 along with the HPA configuration in our production environment. We have observed that whenever the HPA triggers a scale-down operation, many of our applications report 502 errors. Even though number of failures are very less we would like to know, is this behaviour is expected with HPA .?
Approx number of failures is 600 requests/minute Total number of request is 180K requests/minute
What happened:
Observing failure in application logs with 502 error.
What you expected to happen:
There should not be any error/failure during the nginx pod scale down, as we believe nginx will perform a graceful shutdown of pods whenever a scale down operation kick in.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): NGINX Ingress controller Release: v1.11.2 Build: 46e76e5916813cfca2a9b0bfdc34b69a0000f6b9 Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.25.5
Kubernetes version (use
kubectl version
):$ kubectl version Client Version: v1.28.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.4
Environment:
Azure AKS:
Linux (e.g. from /etc/os-release):
Kernel (e.g.
uname -a
):How was the ingress-nginx-controller installed: Using helmrelease with flux $ helm ls -A | grep -i ingress-nginx private-ingress ingress-nginx 11 2024-08-19 10:00:05.820248075 +0000 UTC deployed ingress-nginx-4.11.2 1.11.2
Current State of the controller: Everything looks fine from the ingress pod specific.
How to reproduce this issue: We don't see this issue in UAT, may be because of less number of requests. I think it's easy to re-produce this issue with more number of requests with standard HPA config.