Long worker-shutdown-timeout causes long-lived connections to fail

James-Quigley commented 3 days ago

What happened:

The setup:

gRPC client/server with long lived http2 connections
worker-shutdown-timeout: 30m
nginx reloads, starting the shutdown of old workers
a new version of the grpc server is released, causing all the pods to turn ov

What you expected to happen: I would expect the workers to gracefully try to terminate the long lived connections, and to continue to route to valid targets in the meantime.

Instead, I find that grpc clients start to get UNAVAILABLE or UNIMPLEMENTED errors, which I presume is from traffic being routed by the old workers to IPs that don't exist anymore (or are assigned to different pods)

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): v1.8.1

Kubernetes version (use kubectl version): 1.27

Environment: EKS

Cloud provider or hardware configuration: AWS EKS
OS (e.g. from /etc/os-release): Bottlerocket
Kernel (e.g. uname -a): 5.15.160
Install tools: AWS EKS
Basic cluster related info: v1.27

How was the ingress-nginx-controller installed:

Argo + Helm. Chart version 4.7.1


# ingress-nginx configuration
# https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/values.yaml

ingress-nginx: controller: image: registry: registry.k8s.io priorityClassName: cluster-core-services kind: Deployment

autoscaling:
  minReplicas: 9
  maxReplicas: 35
  targetCPUUtilizationPercentage: 65

maxUnavailable: 2
resources:
  requests:
    cpu: 3000m
    memory: 4500Mi

# the reload pod needs to see the nginx pid in the controller pod, so enable sharing of pids
shareProcessNamespace: true

# Name of the ingress class to route through this controller
ingressClassResource:
  name: https-internal
  default: true
  controllerValue: k8s.io/ingress-https-internal

# Process IngressClass per name
ingressClassByName: true

# Use the dedicated ingress nodes
nodeSelector:
  dedicated: ingress

# Tolerate taints on dedicated ingress nodes.
tolerations:
  - key: dedicated
    value: ingress
    operator: Equal
    effect: NoSchedule

service:
  annotations:
    # Use a (TCP) Network Load Balancer.
    # https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true,deregistration_delay.connection_termination.enabled=true,deregistration_delay.timeout_seconds=120
    # "Internal" load balancing is enabled by default.

  externalTrafficPolicy: Local

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: ingress-https-internal
    scrapeInterval: 15s

## Additional command line arguments to pass to nginx-ingress-controller
## E.g. to specify the default SSL certificate you can use
## extraArgs:
##   default-ssl-certificate: "<namespace>/<secret_name>"
extraArgs:
  ingress-class: https-internal
  # Instructs the controller to wait this many seconds before sending the quit signal to nginx.
  # After receiving the quit signal, nginx will attempt to complete any requests before reaching the
  # terminating grace period.
  shutdown-grace-period: 360

autoscaling:
  enabled: true
  minReplicas: 6
  maxReplicas: 25
  targetCPUUtilizationPercentage: 85
  targetMemoryUtilizationPercentage: null
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 180
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

# see per-env values for resource configuration

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/instance: ingress-https-internal

maxUnavailable: 1
terminationGracePeriodSeconds: 2100  # 35 mins

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/instance
                operator: In
                values:
                  - ingress-https-internal
          topologyKey: kubernetes.io/hostname

allowSnippetAnnotations: true
config:
  use-forwarded-headers: true
  worker-shutdown-timeout: 30m
  enable-underscores-in-headers: true

admissionWebhooks: patch: image: registry: registry.k8s.io



**How to reproduce this issue**:
Still working on an easy way to reproduce this

**Anything else we need to know**:
https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer.lua#L297-L317
Seems to sync the list of endpoints periodically, based on the configuration that gets handled by the lua http server (https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/configuration.lua#L245-L248)
That value is sent from the Go controller here: https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/controller/nginx.go#L944-L987

The timer in the worker init though:
https://github.com/openresty/lua-nginx-module?tab=readme-ov-file#ngxtimerevery

> timer will be created every delay seconds until the current Nginx worker process starts exiting
> timer expiration happens when the Nginx worker process is trying to shut down, as in an Nginx configuration reload triggered by the HUP signal or in an Nginx server shutdown

Therefore I believe that if:
- worker-shutdown-timeout is long
- an nginx reload is triggered
- during the duration of worker-shutdown-timeout, the backing pod ips change (therefore the endpoints change)
- and a client is holding open a connection and sending requests
- Those requests will be routed to invalid targets, as the worker will no longer be getting an updated list of endpoints.

k8s-ci-robot commented 3 days ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

James-Quigley commented 3 days ago

I'm also happy to share some extra evidence privately. But I was able to:

use prometheus to pull a list of IPs for pods before and after the deployment rollout
query ingress-nginx logs for requests to the service in question, filtering for IPs in the first list, but omitting requests to ips in the second list

The log volume shows a massive drop off in requests right as the rollout happens (which I would expect). However, a small tail of requests continues to the old pods, despite the fact that they don't exist (or at least aren't used by the app in question anymore)

longwuyuan commented 3 days ago

/remove-kind bug /triage needs-information

I think that the user her https://github.com/kubernetes/ingress-nginx/issues/11508 also reports the same behavior..

While we wait for other comments and maybe some expert opinion, I made an assumption that in a multi-replica use-case, the routing and loadbalancing of traffic will be the same regardless of the backend-protocol from controller to backend being gRPC or HTTP. Let me know if my assumption is wrong. It will take me too long to dive into the code and ascertain this.

So I tested with a 5 replica deployment using image nginx:alpine and generating at least 1 new connection per second with multiple sessions, to that deployment's HTTP backend-protocol ingress.

I could not reproduce the problem indicated in these 2 issues. I did see a transition related response code though.

So I am inclined to think that off-beat connection requesting client needs to handle the transitions better. It seems like such client have a myopic view that neither they use affinity/persistence nor do they accommodate the transition related events like graceful-drain of connections and route to another backend pod.

In any case, it will help a lot to get a step-by-step guide to reproduce the problem.

kubernetes / ingress-nginx

Long worker-shutdown-timeout causes long-lived connections to fail #11515