solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.09k stars 438 forks source link

503 errors with UH flags after rolling update of the Kubernetes nodes #5449

Open djannot opened 3 years ago

djannot commented 3 years ago

Describe the bug The rolling update has been performed to change the ami image of the nodes and there have been no issues during the update. The Kubernetes nodes where the gateway-proxy Pods are running have been upgraded first.

After the end of the operation, everything was working well, but 503 errors started to happen (with the UH flag) several minutes later. It looks like the 503 errors started to happen after a specific cds update that was applied to all the nodes but one.

There's only one main Upstream with the following configuration:

apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  labels:
    ***
  name: ***
  namespace: ***
spec:
  circuitBreakers:
    maxConnections: 10240
    maxPendingRequests: 10240
    maxRequests: 10240
    maxRetries: 3
  failover:
    prioritizedLocalities:
    - localityEndpoints:
      - lbEndpoints:
        - address: ***
          port: 15443
          upstreamSslConfig:
            secretRef:
              name: failover-upstream
              namespace: gloo-system
            sni: ***
        locality:
          region: ***
          zone: ***
      - lbEndpoints:
        - address: ***
          port: 15443
          upstreamSslConfig:
            secretRef:
              name: failover-upstream
              namespace: gloo-system
            sni: ***
        locality:
          region: ***
          zone: ***
  healthChecks:
  - healthyThreshold: 2
    httpHealthCheck:
      path: /health/***
    interval: 2s
    timeout: 2s
    unhealthyThreshold: 2
  static:
    hosts:
    - addr: ***
      port: 80
    useTls: false
  useHttp2: false
status:
  reportedBy: gloo
  state: 1

After restarting all the Gloo Edge Pods again, the issue disappeared.

To Reproduce The customer hasn't been able to reproduce the issue.

Additional context Add any other context about the problem here, e.g.

djannot commented 3 years ago

Here are some additional details provided by the customer:

we observed cds updates every 6-8 mins, a couple of them went through just fine, a specific one that was applied to 5 out of 6 pods at the same time caused the upstream to go unhealty (all requests resulted in 503 after the update) 1 pod did not see this update, such pod continued happly to serve 200

all pods received 13 cds updates

5 pods received an update at 10:04:12 all these pods started returning 503 after such update, 2 pods  eventually recovered and started serving 200 again without any apparent relation with a cds update

1 pod did not receive an update at 10:04:12 however, it received an update at 10:03:41 which did not affect the upstream health check, such pod always returned 200

only the cds update at 10:04:12 affected the upstream, following updates did not change the state (pods continued serving 503 or 200) 
bcollard commented 3 years ago

looks similar to https://github.com/solo-io/gloo/issues/5442

github-actions[bot] commented 4 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.