kubernetes-sigs / gateway-api

Repository for the next iteration of composite service (e.g. Ingress) and load balancing APIs.
https://gateway-api.sigs.k8s.io
Apache License 2.0
1.69k stars 445 forks source link

xRoute backend update guarantees #1870

Open dprotaso opened 1 year ago

dprotaso commented 1 year ago

It would be good to discuss what guarantees implementors can provide during an update of an xRoute. Specifically in Knative we've seen ingresses drop traffic when transitioning from one backend to another - even when both backends have healthy endpoints.

The situations I've seen this occur is when

1. Ingress doesn't track backend endpoints that aren't reachable via routes

You have two healthy backends A & B

When a route is updated from pointing to backend A to backend B the proxy has to page in backend B's endpoints. This unfortunately races against the route config being rolled out. Ingress drops traffic because endpoints aren't present and the route isn't pointing to backend A anymore.

2. Newly deployed healthy backends aren't safe for Routes reference

Even if an ingress solution tracks all endpoints there's a small window where traffic can be dropped when switching to a newly deployed backend C.

In Knative we have a test were we continuously probe a HTTPRoute while performing the following steps

  1. Spin up a deployment and service
  2. Wait for a healthy endpoint to appear (by polling the API server)
  3. Update an HTTPRoute to point to the new backend.
  4. Repeat 1-3 about 10x

The race here is the test observes and updates the HTTPRoute's backend faster than the ingress proxy can process/track the new endpoint.

evankanderson commented 1 year ago

Just to point out, while Knative is using and testing this functionality more aggressively than most implementations, every user of the Gateway API (including tools like https://flagger.app/) will be trying to perform these operations:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: http-filter-1
spec:
  hostnames:
    - my.site.com
  rules:
      backendRefs:
        - name: my-app-blue
          port: 80

add a second backend:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: http-filter-1
spec:
  hostnames:
    - my.site.com
  rules:
      backendRefs:
        - name: my-app-green
          port: 80

Note that Knative's networking tests perform this type of configuration change multiple times over a few-minutes test while maintaining at least one concurrent request in flight, which may be more sensitive than some monitoring which only probes once every few minutes.

howardjohn commented 1 year ago

I think (1) is valid, but (2) is not, IMO. This actually doesn't even work in Kubernetes core with just basic Service routing. The Endpoint being generated is the signal to kube-proxy to start serving it; just because its there doesn't mean its immediately available to consumers.

If your client detects the endpoint and sends traffic to the Service faster than kube-proxy is programmed you will see dropped traffic.

arkodg commented 1 year ago

a Ready condition at the HTTPRoute level that the client can wait on until Ready=True would solve 1. & 2.

For the case when the endpoint/backend is transitioning from Ready to Terminating, some form of health checks will be needed (passive/outlier detection+retries or active) to eliminate any traffic loss.

sidenote - @dprotaso @evankanderson are above tests defined programmatically, that implementations can run ?

howardjohn commented 1 year ago

(1) should be enforced without any status - changes to a route should be zero downtime or routes are effectively immutable, IMO.

evankanderson commented 1 year ago

I believe the update test in https://github.com/knative-sandbox/net-gateway-api/tree/main#getting-started should work for that. Knative currently only tests Istio and Contour (and has some gaps for each!), but the code in question for the update test lives here:

https://github.com/knative/networking/blob/main/test/conformance/ingress/update.go

And yes, I'm especially concerned about failure mode 1 (as described in my earlier comment), because the only "workaround" is to shift load via DNS to a different cluster prior to making routing changes, which isn't really scalable.

For (2), does it seem reasonable to block the data-plane routing updates until the Service backends are resolved in the gateway api control plane? This would be more complex, as the control plane would need to define the new backend endpoints and wait for the data plane to complete a health check before updating the routing to point to the new backend.

For newly-created routes, a Ready condition is helpful, but when we're talking about updating existing HTTPRoutes we need to smoothly transition from one valid state to a new valid state without going through a "drop some traffic" state.

howardjohn commented 1 year ago

For (2), does it seem reasonable to block the data-plane routing updates until the Service backends are resolved in the gateway api control plane? This would be more complex, as the control plane would need to define the new backend endpoints and wait for the data plane to complete a health check before updating the routing to point to the new backend.

In a vacuum, sort of. I think its hard to do, or at least it would be in Istio which is what i am familiar with, and at least harder than it sounds.

We don't want to block all updates, just to that specific route. So that means either the control or data plane needs to keep both copies. We also need to know when we are in the state of "waiting for backend", because we don't want to wait forever i'd imagine (or do we? even on a new route, do not apply it until all backends are healthy?). So we need to be fairly stateful. This is plausible, but may not robust - we aren't (realistically) going to become truly stateful, so if the control plane restarts during this operation we may suddenly flip to the unhealthy backend; perhaps thats an edge case we can live with.

To handle it, we also need a mechanism to know "is the backend ready" which is more robust than the k8s state and actually is based on the underlying data plane. That is possible to determine in general with Envoy, but becomes tricky in practice. Do we need all instances of the data plane to be ready, or only some of them? We may also have multiple control plane replicas, so now we need to share this state between control plane instances. What if we are a multi-cluster mesh, do we need all control planes to share with all other control planes and ensure 100% of all data plane instances across the globe have the change before we shift? Can a single envoy instances being down (in the global view) block all updates?

robscott commented 1 year ago

For triage:

/assign @arkodg

arkodg commented 1 year ago

@dprotaso would a conformance test for 1) be a good next step ?

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

evankanderson commented 1 year ago

/remove-lifecycle stale

dprotaso commented 1 year ago

/lifecycle frozen

arkodg commented 11 months ago

Ive raised https://github.com/kubernetes-sigs/gateway-api/issues/2303 to solve 1. which imo is a test, implementations should strive to achieve.

  1. boils down to - what information should someone rely on to ensure that that the network policy has been programmed into the data plane to ensure there are no traffic drops
    • we can't rely on querying the Kube API server, because at this point we cannot guarantee if the implementation has received and programmed this into the data plane
    • we can't rely on the Programmed status condition in the HTTPRoute because we cannot guarantee if the route has been actually programmed in the data plane.
    • a new status condition such as Ready implemented by the controller, could be used in theory to guarantee that the routes in the data plane are ready, but this is a hard problem to solve for most implementations.

Instead of focusing of what information is needed by a client before starting to send requests, to guarantee no traffic drops, we can instead come up with a max duration within which a controller should program routes.

here is the example steps provided in the original description

1.Spin up a deployment and service 2.Wait for a healthy endpoint to appear (by polling the API server) 3.Update an HTTPRoute to point to the new backend. 4.Repeat 1-3 about 10x

A conformance test can be written to ensure that the tests waits for a duration of time between step 3. and 4. for some duration (lets say 1s), but expects no failures at step4.

hoping to hear from others reg this, at this point all conformance tests focus on correctness rather than performance, and this one would fail under the performance bucket imo.

youngnick commented 11 months ago

I'm pretty reluctant to add any performance-based conformance test before GA, I think that this one in particular is very easy to get wrong, both from the conformance and the implementation sides. I also don't know if we want to mandate zero-packet-loss backend changes - that seems like it could be a high bar for some implementations. Maybe when we do add performance-style testing, we extend the testing result YAML to include the numbers (packets lost in this case)?

Again, I don't think we can move on @arkodg's suggestion before GA, sadly, there's too many sharp edges.