solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 434 forks source link

Sporadic 503s in EKS when not under load #5952

Open nfuden opened 2 years ago

nfuden commented 2 years ago

Gloo Edge Version

1.10.x (latest stable)

Kubernetes Version

No response

Describe the bug

When deployed in eks without appreciable load (1 request a second) the gateway-proxy will fail some requests to upstreams with a 503.

Trace logs show that this is when

Steps to reproduce the bug

  1. Get an eks cluster
  2. Start making 1 request a second

Expected Behavior

You will get some 503s and see something like this in the logs

[2022-02-18 12:27:02.423][30][debug][misc] [external/envoy/source/common/network/io_socket_error_impl.cc:64] Unknown error code 32 details Broken pipe [2022-02-18 12:27:02.423][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:304] [C249] SSL shutdown: rc=-1 [2022-02-18 12:27:02.423][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:217] [C249] TLS error: 33554536:system library:OPENSSL_internal:Connection reset by peer 33554464:system library:OPENSSL_internal:Broken pipe

Additional Context

The right answer may be to always advise the set up of maxRetries like in https://github.com/solo-io/gloo/issues/4798 but it would be good to understand if there is something else that we should be doing as well.

nfuden commented 2 years ago

This may be resolved by upgrading to 1.10.6.

An interesting note for research here is that it may be related to aws lambda usage without the forcible cred refresh timer.

github-actions[bot] commented 3 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.