Open norbjd opened 3 weeks ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: norbjd Once this PR has been reviewed and has the lgtm label, please assign davidhadas for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 62.50%. Comparing base (
593ddde
) to head (c48df14
). Report is 1 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
/retest
Note: I don't know how to write an integration test for this... I've tried many things to kill TCP connections, mimic a flaky network, but can't have something working consistently. I guess network issues happen when we expect them the least :sweat_smile: The closest I've found to "cut" connections is to use network policies, but alas it's not supported on kind
(https://github.com/kubernetes-sigs/kind/issues/842) so it won't help in our case.
Putting back in draft because I need to test thoroughly to ensure it works as expected, sorry for the ping!
Changes
/kind enhancement
For context, we are operating kourier "at scale":
In the gateway access logs, we sometimes see clients requests ending up in 503. These 503 are always mostly accompanied by
UC
response flag, meaning (from the docs):Connections can be terminated for many reasons, but most of the time, these are linked to transient TCP failures (connect timeout, reset, disconnect, etc.). As of now, the only way to deal with these 503 UC errors is to retry on the caller side, which is not really convenient - and not always possible - for callers.
In order to increase robustness and handle these specific transient cases, Envoy allows setting retry policies at
VirtualHost
level and/orRoute
level. By default, these retry policies are not configured by Kourier, so this is why Envoy throws directly 503s to the caller, sadly.This PR configures the
RetryPolicy
at theVirtualHost
level (because every VH has multiple routes, it would be cumbersome to define it on every route). The conditions to retry, as explained in Envoy docs, covers all transient upstream connection failures:Note 1: BTW, this is also what istio seems to do by default: https://istio.io/latest/docs/concepts/traffic-management/#retries, https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry, but I'm not really familiar with it, so I can just trust the docs and what I find on the web.
Note 2: It might be tempting to retry on every upstream errors (e.g. 5xx), but it is probably not a good idea, as "real" 5xx sent by the users applications (in the ksvc) might not be retriable and can cause more harm if retried. Here, we will just focus on TCP connection errors.
Regarding the changes made on the PR itself: I'm pretty sure setting the retry policy for every
VirtualHost
can't be harmful; but, as I don't know your opinion on this, I've hidden it behind an option (WithRetryOnTransientUpstreamFailure()
, using option pattern).For now, the option is always on (
pkg/generator/ingress_translator.go
), but if you prefer, I can easily make it configurable through Kourier configmap (e.g. ifretry-on-upstream-transient-failures: true
in the config, call the option; otherwise, bypass it). When adding this option, I have also changedNewVirtualHostWithExtAuthz
, because I didn't want to addif
s everywhere, and managing this with options is far more convenient.So, from there, there are 2 paths I can take:
RetryPolicy
inNewVirtualHost
method (andNewVirtualHostWithExtAuthz
, by extension)translateIngress
signature (and all methods calling it) to include theWithRetryOnTransientUpstreamFailure()
option only if the user have opted-in in kourier configSolution 1 is easier to implement but does not guarantee side-effects (just adding retries in case of TCP connection issues should be fine though...), while solution 2 allows to be more configurable and retry can stay disabled by default.
Tell me what you think. Thanks :pray:
Release Note
Docs
N/A