kubernetes / ingress-nginx

Ingress-NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.25k stars 8.21k forks source link

proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852

Open marvin-roesch opened 2 weeks ago

marvin-roesch commented 2 weeks ago

What happened: When one backend pod fails under a condition covered by proxy_next_upstream (e.g. http_404 for easy testing), if there's a large volume of requests, any one request may reuse the same backend for all tries rather than actually using the "next" backend. This happens for sure with the default round-robin balancer, but most likely with all balancer implementations.

What you expected to happen: If a backend request fails due to one of the proxy_next_upstream conditions, it should be retried with at least one of the other available backends, regardless of the configured load balancer or any concurrent requests.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.11.2

Kubernetes version (use kubectl version): 1.28.10

Environment:

How to reproduce this issue:

Install minikube/kind

Install the ingress controller

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/baremetal/deploy.yaml

Install an application with at least 2 pods that will always respond with status 404

echo '
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    replicas: 2
    selector:
      matchLabels:
        app: next-upstream-repro
    template:
      metadata:
        labels:
          app: next-upstream-repro
      spec:
        containers:
        - image: nginx
          imagePullPolicy: IfNotPresent
          name: nginx
          ports:
          - containerPort: 80
          volumeMounts:
            - name: conf
              mountPath: /etc/nginx/conf.d
        volumes:
          - name: conf
            configMap:
              name: next-upstream-repro
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: next-upstream-repro
    namespace: default
  spec:
    ports:
      - name: http
        port: 80
        targetPort: 80
        protocol: TCP
    type: ClusterIP
    selector:
      app: next-upstream-repro
  ---
  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: next-upstream-repro
    namespace: default
  data:
    default.conf: |
      server {
        listen       80;
        server_name  localhost;

        location = / {
          return 404 "$hostname\n";
        }
      }
' | kubectl apply -f -

Create an ingress which tries next upstream on 404

echo "
  apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    name: next-upstream-repro
    annotations:
      nginx.ingress.kubernetes.io/proxy-next-upstream: 'error http_404 timeout'
  spec:
    ingressClassName: nginx
    rules:
    - host: foo.bar
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: next-upstream-repro
              port:
                name: http
" | kubectl apply -f -

Make many requests in parallel

POD_NAME=$(k get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -o NAME)
kubectl exec -it -n ingress-nginx $POD_NAME -- bash -c "seq 1 200 | xargs -I{} -n1 -P10 curl -H 'Host: foo.bar' localhost"

Observe in the ingress controller's access logs (kubectl logs -n ingress-nginx $POD_NAME) that many requests will have the same upstream in succession in $upstream_addr, e.g.

::1 - - [23/Aug/2024:08:49:42 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.000 [default-next-upstream-repro-http] [] 10.1.254.92:80, 10.1.254.92:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 afa1e1e8964286bd7d1b7664f606bb2f
::1 - - [23/Aug/2024:08:53:21 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.001 [default-next-upstream-repro-http] [] 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 b753b1828cc200d3c95d6ecbc6ba80e6

Anything else we need to know: The problem is exacerbated by few (like 2 in the repro case) backend pods being hit by a large request volume concurrently. There is basically a conflict between global load balancing behaviour and per-request retries at play here. For e.g. the default round-robin load balancer, the instance is obviously shared by all requests (on an nginx worker) for a particular backend.

Assuming a system with 2 backend endpoints for the sake of simplicity, the flow of information can be as follows:

  1. Request 1 reaches ingress nginx, gets routed to endpoint A by round robin balancer, waits for response from backend
  2. Round robin balancer state: Next endpoint is endpoint B
  3. Request 2 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
  4. Round robin balancer state: Next endpoint is endpoint A
  5. Response from endpoint A fails for request 1, proxy_next_upstream config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
  6. Round robin balancer state: Next endpoint is endpoint B
  7. Request 3 reaches ingress nginx, gets routed to endpoint B by round robin balancer, waits for response from backend
  8. Round robin balancer state: Next endpoint is endpoint A
  9. Response from endpoint B fails for request 2, proxy_next_upstream config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancer
  10. Responses from all endpoints for request 1, 2, and 3 succeed

As you can see, this means request 1 is only handled by endpoint A despite the proxy_next_upstream directive. Depending on the actual rate and order of requests etc, request 2 could have faced a similar fate, but request 3 came in before the initial response failed, so it happens to work out in that case.

This makes proxy-next-upstream extremely unreliable and behave in unexpected ways. An approach to fixing this would be that the Lua-based load balancing be made aware of what endpoints have already been tried. The semantics are hard to nail down exactly, however, since this might break the guarantees that some load balancing strategies aim to provide. On the other hand, having the next upstream choice work reliably at all is invaluable for bridging over requests in a failure scenario. A backend endpoint might become unreachable, which should result in it eventually being removed from the load balancing once probes have caught up to the fact. In the meantime, the default error timeout strategy would try the "next" available upstream for any requests trying that endpoint, but if everything aligns just right, the load balancer would always return the same endpoint, resulting in a 502 despite the system at large being perfectly capable of handling the request.

k8s-ci-robot commented 2 weeks ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 2 weeks ago

/move-kind bug /kind support /triage needs-information

I understand why your reproduce example choice is to have pods with nginx configured to return 404 on location /.

But that can not be considered a real-world use-case as real-world workloads don't have all the pods configured for returning 404.

If you want you can change the test to a real-world use case where first the pod or pods are returning 200. Then introduce a event for 4XX or 5XX. And so on and so forth. But unless you can post the ta like kubectl describe outputs and kubectl logs outputs and curl outputs and responses etc etc etc, others will have to make efforts over and beyond the normal to even triage this issue.

longwuyuan commented 2 weeks ago

/remove-kind bug

marvin-roesch commented 2 weeks ago

@longwuyuan While I agree the reproduction example is a bit contrived and not particularly reflective of any real world use case, it is the easiest way to reliably reproduce this issue without overly complicating the test case. The sporadic nature of this issue is why I have opted for such a simplistic approach for reproducing it. If the backend service is acting reliably at all (preventing proxy_next_upstream from having to do anything), the probability of encountering this issue goes down drastically. It doesn't particularly matter that in the end the response still is 404, the access logs I have included clearly demonstrate the issue.

To maybe point you more directly to where the issue lies as can be seen from my reproduction example, note this access log line that one of my curls produced in the ingress controller nginx:

::1 - - [23/Aug/2024:08:53:21 +0000] "GET / HTTP/1.1" 404 1 "-" "curl/8.9.0" 70 0.001 [default-next-upstream-repro-http] [] 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80 0, 0, 1 0.000, 0.000, 0.000 404, 404, 404 b753b1828cc200d3c95d6ecbc6ba80e6

As you can see, the $upstream_addr value is 10.1.254.93:80, 10.1.254.93:80, 10.1.254.93:80, so the same endpoint gets used thrice over despite the proxy_next_upstream config. I have omitted the surrounding access log lines that look similar (just with a few different $upstream_addr values) since they just add noise to the fairly random nature of this issue.

I have amended the command for getting the logs from the ingress controller and will happily provide more information, but I think the example I have provided is the minimal reproducible one. The problem happens entirely in the ingress nginx and for any error case that proxy_next_upstream can handle, a 404 is just much simpler to produce than a connection error.

longwuyuan commented 2 weeks ago

ok. I am on a learning curve here so please help out with some questions.

Replicas is 2 and both are causing a lookup for next upstream. Do I have to reproduce on my own to figure out what happens when there is at least one replica that is doing 200 instead of 404 ? Does that one not get picked ?

marvin-roesch commented 2 weeks ago

If any of the upstreams that get picked return a non-error response, nginx behaves as expected and ends the retry chain there. Since for any one request, another attempt is only performed in the case there is an error according to the proxy_next_upstream config, the problem lies solely with how the next upstream gets picked by the Lua implementation of load balancing as shipped with and used by the ingress controller by default (https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer.lua being the entry point for this).

The default template configures proxy_next_upstream for upto 3 attempts, which is where the 3 occurrences in $upstream_addr in the access logs come from. Leaving everything else also at the default (i.e. using round-robin load balancing), manually firing an occasional request is not going to show the issue of the same upstream being used twice in a row, because the global load balancer state isn't affected by multiple concurrent requests. That's why my repro has a command that performs several requests in parallel.

ZJfans commented 2 weeks ago

In Nginx, fail_timeout and max_fails remove failed backend for a certain period of time, but the balancer does not have this capability.

proxy_next_upstream_tries 3;
upstream test {
    server 127.0.0.1:8080 fail_timeout=30s max_fails=1; # Server A
    server 127.0.0.1:8081 fail_timeout=30s max_fails=1; # Server B

}

If A fails once, it will be removed for 30 seconds