Gateway API: implement proportional 503 responses when invalid backend refs

skriss commented 3 years ago

ref. https://gateway-api.sigs.k8s.io/v1alpha2/references/spec/#gateway.networking.k8s.io/v1alpha2.HTTPRouteRule (backendRefs field):

If unspecified or invalid (refers to a non-existent resource or a Service with no endpoints), the rule performs no forwarding. If there are also no filters specified that would result in a response being sent, a HTTP 503 status code is returned. 503 responses must be sent so that the overall weight is respected; if an invalid backend is requested to have 80% of requests, then 80% of requests must get a 503 instead.

We currently handle configuring a 503 if there are no valid backends, but we do not implement the proportional 503 responses when some of the backends are invalid.

xaleeks commented 3 years ago

Maybe I’m misunderstanding but the following doesn’t make sense to me.

if an invalid backend is requested to have 80% of requests, then 80% of requests must get a 503 instead.

It should be 100% of the requests going to that backend generating a 503 no?

Do we need to bring HTTPProxy into parity with HTTPRoute as well. Let’s say for some route, we designate 20% to service foo and 80% to service bar and service bar’s endpoint is down, so we only generate 503s for those requests that flow to service bar. Is the current behavior such that these are being skipped entirely due to partial availability of service foo?

stevesloka commented 3 years ago

It's a matter of how you see the world. I'd expect to try and serve as many requests as possible so I'd opt for only sending traffic to the foo service, skipping the errors with the bar service and setting a condition to tell the user what's happening.

When this was designed upstream in HTTPRoute, others didn't feel the same way, in that, you'd overwhelm the other service which is only expecting 20% traffic.

If you did nothing (i.e. no logic at all), then you'd get what you're looking for. Any down service or bad endpoint will get a 503 from Envoy.

youngnick commented 3 years ago

Yeah, I think this is most important if we think about it the other way - if you are bringing in a new service version at 1% of traffic, and it's broken, you want the 1% of traffic that should be sent to that service to fail, rather than silently succeeding because it would produce an error. I understand that it's nice to be able to save people from an inadvertent error, but in the case you're canary testing, having the same failure rate if the service is completely gone is quite important, I think.

stevesloka commented 3 years ago

I understand that it's nice to be able to save people from an inadvertent error, but in the case you're canary testing, having the same failure rate if the service is completely gone is quite important, I think.

I struggle with that because as a user I don't care your system is down, I want my requests to work! =)

If a k8s node went unhealthy, should we still try to schedule to it so you can see the errors? Why have health checks from an external load balancer to a set of nodes? To me its sort of the same thing.

There's also a difference from requests going to a backend and failing which Contour can't know up front, but also knowing that all the endpoints of a service are bad. That case Contour can and does know about, but the api spec says to still route traffic even though there's no chance of the requests to succeed.

youngnick commented 3 years ago

I understand that the end user won't like that, but in this case, the user for Contour of this feature is the application owner, not the end user, and they will care about ensuring that they know that a canary has failed. It won't be any good for the user if the application owner doesn't realise that their service is broken, changes weights to one hundred percent, and all of a sudden has one hundred percent failure, when there was one hundred percent success just before, because we were trying to be helpful.

xaleeks commented 3 years ago

Fair concern that it would overwhelm the healthy service on account of the services being deployed in a manner consistent with the expected traffic in the first place.

Also fair point that as the cluster admin, I’d like to be notified right away (via 503s for ex.) that one of my endpoints is down instead of silently rerouting traffic. But then as the end user, I really don’t care your back-end is down and I expect resiliency and elasticity in your infrastructure and only that my request is served.

I think for this ticket, we need to

Implement the gateway API as is intended or bring up any disagreements with the upstream
Attempt consistency with how HTTPProxy implements this behavior
Document differences between the two approaches if there are any

stevesloka commented 3 years ago

This is why Envoy has active & passive health checking, to look for issues and reroute based upon how those endpoints are behaving (https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#arch-overview-outlier-detection).

Same issue as https://github.com/projectcontour/contour/issues/1119

youngnick commented 3 years ago

In any event, for us to be able to meet the Gateway API contract for v1alpha2, we have to implement this. I've updated #1119 with more thoughts on the subject, but for this issue, I don't think that we can argue that we either need to implement the spec as it is, or urgently update the upstream.

skriss commented 2 years ago

This will probably become a conformance issue at some point, but for now, removing from the 1.22 milestone, still open to a contributor picking this up though!

github-actions[bot] commented 1 year ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

projectcontour / contour

Gateway API: implement proportional 503 responses when invalid backend refs #4125