norbjd commented 3 weeks ago

Changes

:gift: Add RetryPolicy on VirtualHost for upstream transient connection issues

/kind enhancement

For context, we are operating kourier "at scale":

thousands of knative services
hundreds of requests/second
a lot of activity with new ksvc created, updated, deleted continuously

In the gateway access logs, we sometimes see clients requests ending up in 503. These 503 are always mostly accompanied by UC response flag, meaning (from the docs):

Upstream connection termination in addition to 503 response code.

Connections can be terminated for many reasons, but most of the time, these are linked to transient TCP failures (connect timeout, reset, disconnect, etc.). As of now, the only way to deal with these 503 UC errors is to retry on the caller side, which is not really convenient - and not always possible - for callers.

In order to increase robustness and handle these specific transient cases, Envoy allows setting retry policies at VirtualHost level and/or Route level. By default, these retry policies are not configured by Kourier, so this is why Envoy throws directly 503s to the caller, sadly.

This PR configures the RetryPolicy at the VirtualHost level (because every VH has multiple routes, it would be cumbersome to define it on every route). The conditions to retry, as explained in Envoy docs, covers all transient upstream connection failures:

reset: Envoy will attempt a retry if the upstream server does not respond at all (disconnect/reset/read timeout.)

connect-failure: Envoy will attempt a retry if a request is failed because of a connection failure to the upstream server (connect timeout, etc.). [...] NOTE: A connection failure/timeout is a the TCP level, not the request level

Note 1: BTW, this is also what istio seems to do by default: https://istio.io/latest/docs/concepts/traffic-management/#retries, https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry, but I'm not really familiar with it, so I can just trust the docs and what I find on the web.

Note 2: It might be tempting to retry on every upstream errors (e.g. 5xx), but it is probably not a good idea, as "real" 5xx sent by the users applications (in the ksvc) might not be retriable and can cause more harm if retried. Here, we will just focus on TCP connection errors.

Regarding the changes made on the PR itself: I'm pretty sure setting the retry policy for every VirtualHost can't be harmful; but, as I don't know your opinion on this, I've hidden it behind an option (WithRetryOnTransientUpstreamFailure(), using option pattern).

For now, the option is always on (pkg/generator/ingress_translator.go), but if you prefer, I can easily make it configurable through Kourier configmap (e.g. if retry-on-upstream-transient-failures: true in the config, call the option; otherwise, bypass it). When adding this option, I have also changed NewVirtualHostWithExtAuthz, because I didn't want to add ifs everywhere, and managing this with options is far more convenient.

So, from there, there are 2 paths I can take:

configure the retry by default: in that case, I can completely remove the "options" pattern, and just configure the RetryPolicy in NewVirtualHost method (and NewVirtualHostWithExtAuthz, by extension)
make the retry configurable by the user: in that case, I will keep the "options" pattern, but I will need to change translateIngress signature (and all methods calling it) to include the WithRetryOnTransientUpstreamFailure() option only if the user have opted-in in kourier config

Solution 1 is easier to implement but does not guarantee side-effects (just adding retries in case of TCP connection issues should be fine though...), while solution 2 allows to be more configurable and retry can stay disabled by default.

Tell me what you think. Thanks :pray:

Release Note

Add retry policy on every virtual host for upstream transient connection issues

Docs

N/A

knative-prow[bot] commented 3 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbjd Once this PR has been reviewed and has the lgtm label, please assign davidhadas for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/knative-extensions/net-kourier/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 62.50%. Comparing base (593ddde) to head (c48df14). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1267 +/- ## ========================================== + Coverage 62.31% 62.50% +0.19% ========================================== Files 24 24 Lines 1632 1635 +3 ========================================== + Hits 1017 1022 +5 + Misses 553 552 -1 + Partials 62 61 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

norbjd commented 3 weeks ago

/retest

norbjd commented 3 weeks ago

Note: I don't know how to write an integration test for this... I've tried many things to kill TCP connections, mimic a flaky network, but can't have something working consistently. I guess network issues happen when we expect them the least :sweat_smile: The closest I've found to "cut" connections is to use network policies, but alas it's not supported on kind (https://github.com/kubernetes-sigs/kind/issues/842) so it won't help in our case.

norbjd commented 3 weeks ago

Putting back in draft because I need to test thoroughly to ensure it works as expected, sorry for the ping!

knative-extensions / net-kourier

Add RetryPolicy on VirtualHost for upstream transient connection issues #1267

Changes

Codecov Report