solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.04k stars 432 forks source link

Gloo gateway-proxy issue during zero-downtime upgrades #9036

Open Piotreqsl opened 6 months ago

Piotreqsl commented 6 months ago

Gloo Edge Product

Enterprise

Gloo Edge Version

v1.19.3

Kubernetes Version

v1.27.8

Describe the bug

We have a problem with gloo gateway-proxy pods during our cluster upgrades.

Our application needs to handle many websocket connections and ensure each of them is served properly. We've implemented custom PreStop hooks in our pods and it works fine during normal upgrades (e.g. changing docker tag) or HPA scaling - Pod waits till last of websocket connection is terminated or till specified timeout.

The problem occurs when we try to upgrade whole cluster (e.g. change type of EC2). We observed that gateway-proxy pods are being killed during upgrade and because of that - connections to our application pods are being terminated, due to connection loss.

We've read https://docs.solo.io/gloo-edge/latest/operations/advanced/zero-downtime-gateway-rollout/ documentation, that says something about health-checks, but we're not sure how would it help us in keeping long-living connection (our pods are in terminating state)

We'd kindly ask for advices how we can configure gloo gateway-proxy in order not to kill existing connections.

Expected Behavior

Gloo gateway-proxy should wait for last connection to terminate or timeout after specified time.

Steps to reproduce the bug

  1. Perform long living websocket connection to application pod (connection must go via gloo proxy)
  2. Perform kubectl rollout restart on deployment/gateway-proxy in gloo-system namespace. //// New gateway-proxy pods should spin off, old ones should go into terminating state.
  3. At the moment gateway-proxy pod is killed, connection to application is killed also.

Additional Environment Detail

No response

Additional Context

No response

soloio-bot commented 6 months ago

Zendesk ticket #3076 has been linked to this issue.

github-actions[bot] commented 1 week ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.