solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 435 forks source link

Client Load Balancing in ExtAuth server #5962

Open jmunozro opened 2 years ago

jmunozro commented 2 years ago

Version

1.10.x (latest stable)

Is your feature request related to a problem? Please describe.

Currently if you configure passthrough auth with 10x replicas, the server chooses one and attaches to it, so the solution does not scale well.

This is a small poc to demonstrate: poc.zip

https://user-images.githubusercontent.com/35881711/155167769-703c4325-5427-4ad6-81df-d81f615b1f04.mov

Describe the solution you'd like

I would like to scale the passthrough servers, as they can become busy and die one by one under heavy load

Describe alternatives you've considered

No response

Additional Context

https://techdozo.dev/grpc-load-balancing-on-kubernetes-using-headless-service/

nfuden commented 2 years ago

It seems that the best way forward here is to first allow for the configuration of loadbalancing policy of the passthrough and then make an opinionated decision around the passthrough address format based on the loadbalancing policy and then expose those options.

elcasteel commented 2 years ago

I think a solution to this issue might affect what we're seeing here: https://github.com/solo-io/gloo/issues/6518 although they're not exactly the same thing.

jmunozro commented 2 years ago

A possible solution suggested by @jenshu As outlined here configuring _MAX_CONNECTIONAGE will close the connection after some time and allow the client to re-resolve the destination.

I've tested it with my little server and a low value (1 second). It worked well with 4 replicas.

    opts := []grpc.ServerOption{
        grpc.KeepaliveParams(keepalive.ServerParameters{
            MaxConnectionAge: 1 * time.Second,
        }),
    }

Here you can see the results of my test

https://user-images.githubusercontent.com/35881711/186878414-f45c53a7-6627-4ec2-8d59-7015607e4050.mov

EItanya commented 2 years ago

So this is an interesting one, and I'm honestly not sure what the right answer here is. Setting a max connection age cuts the connection, but then you lose the benefits of long-lived grpc connections. IMO the correct solution here is probably a more complicated client-balancing algo which opens connections to all pods, but that is obviously more work. Would love @nrjpoddar opinion on this as well.

nrjpoddar commented 2 years ago

There's no good answer here. In Istio for Envoy xDS we did a 30 min timeout to force connection re-establishment forcing a LB decision. @EItanya even if you opened more connections to all instances I'm not sure how it will help.

EItanya commented 2 years ago

I guess I was imagining some system by which the client would have a pool of connections it could use to connect to the passthrough auth server, and pick based on some algo/heuristics

nfuden commented 1 year ago

Yeah I think what I was trying to say previously without sufficient context is that if we dont want to use keepalive we can implement a client side algo to balance and set it here. https://github.com/solo-io/ext-auth-service/blob/master/pkg/config/passthrough/grpc/grpc_client_manager.go#L61-L66 unfortunately this would be hard to keep in check between multiple solo extauth pods but we could stop using roundrobin and instead pass in a shared sync state that would detail the set of known connections. We could clean it up here https://github.com/solo-io/ext-auth-service/blob/master/pkg/config/passthrough/grpc/grpc_client_manager.go#L76-L84 . It likely would be fine to write something that naively balances all connections from extauth to its passthroughs for grpc but that doesnt take into account the possibility of these services having other connections from outside gloo / extauth and if we did take that into account then the logic gets much more complex and fragile or the negotiation time may take longer both which are bad for things on the data plane.

nfuden commented 1 year ago

Per a slack thread there was the correct comment that this may be configured with a kubernetes service which handles the load balancing by default. Given this and the desire to have long running grpc connections to reduce overhead (and therefore latency) the proposed first pass is to implement the 30 minute timeout and come back to this later with a more interesting balancing strategy at a later point in time.

djannot commented 8 months ago

No code change is required. You just need to create a Kubernetes headless service to make sure the extauth server gets the IP addresses of the passthrough Pods.

github-actions[bot] commented 2 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.