Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down.

ranjith-vatakkeel commented 2 weeks ago

We are using the Nginx Ingress chart version v4.11.2 along with the HPA configuration in our production environment. We have observed that whenever the HPA triggers a scale-down operation, many of our applications report 502 errors. Even though number of failures are very less we would like to know, is this behaviour is expected with HPA .?

Approx number of failures is 600 requests/minute Total number of request is 180K requests/minute

      autoscaling:
        behavior:
          scaleDown:
            policies:
            - periodSeconds: 900
              type: Pods
              value: 1
            stabilizationWindowSeconds: 900
          scaleUp:
            policies:
            - periodSeconds: 60
              type: Pods
              value: 2
            stabilizationWindowSeconds: 120
        enabled: true
        maxReplicas: 25
        minReplicas: 3
        targetCPUUtilizationPercentage: 70
        targetMemoryUtilizationPercentage: {}

What happened:

Observing failure in application logs with 502 error.

What you expected to happen:

There should not be any error/failure during the nginx pod scale down, as we believe nginx will perform a graceful shutdown of pods whenever a scale down operation kick in.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): NGINX Ingress controller Release: v1.11.2 Build: 46e76e5916813cfca2a9b0bfdc34b69a0000f6b9 Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.25.5

Kubernetes version (use kubectl version):

$ kubectl version Client Version: v1.28.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.4

Environment:

Azure AKS:
Linux (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
How was the ingress-nginx-controller installed: Using helmrelease with flux $ helm ls -A | grep -i ingress-nginx private-ingress ingress-nginx 11 2024-08-19 10:00:05.820248075 +0000 UTC deployed ingress-nginx-4.11.2 1.11.2
Current State of the controller: Everything looks fine from the ingress pod specific.

How to reproduce this issue: We don't see this issue in UAT, may be because of less number of requests. I think it's easy to re-produce this issue with more number of requests with standard HPA config.

k8s-ci-robot commented 2 weeks ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

Gacko commented 2 weeks ago

Hello,

is Ingress NGINX fronted by a load balancer? Because 502 is a Bad Gateway and I assume your load balancer still having an Ingress NGINX Controller pod in its target list while it's already being shut down.

We recently had a discussion about this here: https://github.com/kubernetes/ingress-nginx/issues/11890

Even though this is AWS, it might still apply to your setup. Normally it could be useful to just increase the shutdown delay of Ingress NGINX by the amount of time your cloud load balancer takes to completely deregister a target.

Regards Marco

longwuyuan commented 2 weeks ago

/remove-kind bug /kind support

toredash commented 5 days ago

I'm assuming your fronting your Nginx pods with an Azure Load Balancer. As already mentioned, you need to ensure that the AKS is able to detect that your pod is about to shutdown before it actually does.

I'm currently working with something similar in AKS, and we increased the shutdown timer in Nginx to 90s in addition to adjusting the various timeouts in the Azure Load Balancer.

If increasing the shutdown time does not work, let us know.

midhun-mohan commented 2 days ago

@toredash How will we ensure that Azure Load Balancer knows that the nginx pod is going to be terminated?

Does nginx have any HTTP endpoints that will tell ALB that the pod is getting killed?

IMO, If we mark the health endpoint as unhealthy before the actual shutdown procedure is initiated and then proceeded with the normal shutdown might help to deregister the targets much before pod is being killed.

( Sidenote : Haven't looked deeper how tied the health endpoint is to the shutdown process )

toredash commented 2 days ago

@toredash How will we ensure that Azure Load Balancer knows that the nginx pod is going to be terminated?

Does nginx have any HTTP endpoints that will tell ALB that the pod is getting killed?

IMO, If we mark the health endpoint as unhealthy before the actual shutdown procedure is initiated and then proceeded with the normal shutdown might help to deregister the targets much before pod is being killed.

( Sidenote : Haven't looked deeper how tied the health endpoint is to the shutdown process )

Once the Pod is instructed to terminate itself, its IP will be removed from the Endpoint list.

helm chart values that I've set specifically to tackle this problem:

controller:
  extraArgs:
    shutdown-grace-period: "90"

  # Must be larger than the time it takes for an IP address in 
  # the Azure Load Balancer to be considered new and healthy
  minReadySeconds: 30

  service:
    annotations: # https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#loadbalancer-annotations
      service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
      service.beta.kubernetes.io/azure-disable-load-balancer-floating-ip: false
    externalTrafficPolicy: Local # Ensure only nodes running Nginx is allowed to receive traffic

With this, I am able to restart my nginx deployment without causing any downtime

kubernetes / ingress-nginx

Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. #12157