grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC
https://grpc.io
Apache License 2.0
20.94k stars 4.34k forks source link

`IDLE_TIMEOUT ` not implemented, alternatives? #5498

Closed GTB3NW closed 2 years ago

GTB3NW commented 2 years ago

Hi, I've read the connectivity semantics and a few github issues mention IDLE_TIMEOUT is not implemented in go and instead keep-alives will be used to keep these connections alive.

I was hoping to be able to see when a connection went into idle state for circumstances where the server tells it to go away, and when an RPC or stream has not been used for the configurable duration.

The decision to not implement this seems to go against the spec and I can't actually see an alternative design other than maybe interceptors, but that feels like a messy alternative.

I'm implementing a client connection cache which will auto evict idle, failed and intentionally closed connections. Is this something that can only be implement on the serverside and if so, how?

dfawley commented 2 years ago

Hi, I've read the connectivity semantics and a few github issues mention IDLE_TIMEOUT is not implemented in go and instead keep-alives will be used to keep these connections alive.

IDLE_TIMEOUT is not used for keeping connections alive, so this is a little confusing. Server-side, keepalive features can configure a MaxConnectionIdle setting to close connections to clients when they are not in use. This is somewhat similar to client-side IDLE_TIMEOUT, which is a feature that makes the client go completely dormant (close all connections including name resolver / load balancing, if applicable) when it is not in use.

I was hoping to be able to see when a connection went into idle state for circumstances where the server tells it to go away, and when an RPC or stream has not been used for the configurable duration.

When a server send a GOAWAY, the client's subchannel connected to that server will transition to IDLE. If your client is using round_robin load balancing (not the default), it will just reconnect immediately. The default load balancer (pick_first) will not attempt to reconnect until an RPC is sent, or Connect (experimental) is called. Calling ClientConn.GetState() will return connectivity.Idle once this happens.

The decision to not implement this seems to go against the spec and I can't actually see an alternative design other than maybe interceptors, but that feels like a messy alternative.

We do intend to implement IDLE_TIMEOUT eventually, but it hasn't been considered a high enough priority for the team to spend time on it so far.

I'm implementing a client connection cache which will auto evict idle, failed and intentionally closed connections. Is this something that can only be implement on the serverside and if so, how?

Can you explain more about what you're trying to do with your cache and how your deployment looks from a high level? Are you creating multiple ClientConns connected to the same service but different backends? We do have advanced load balancing features in gRPC that could potentially help with this and avoid the need to keep your own cache.

GTB3NW commented 2 years ago

IDLE_TIMEOUT is not used for keeping connections alive, so this is a little confusing. Server-side, keepalive features can configure a MaxConnectionIdle setting to close connections to clients when they are not in use. This is somewhat similar to client-side IDLE_TIMEOUT, which is a feature that makes the client go completely dormant (close all connections including name resolver / load balancing, if applicable) when it is not in use.

When a server send a GOAWAY, the client's subchannel connected to that server will transition to IDLE. If your client is using round_robin load balancing (not the default), it will just reconnect immediately. The default load balancer (pick_first) will not attempt to reconnect until an RPC is sent, or Connect (experimental) is called. Calling ClientConn.GetState() will return connectivity.Idle once this happens.

This sounds like exactly what I need actually. I did know about the GOAWAY causing the client to go to idle state, but under which circumstances I didn't and it's good to know which server setting. I very much appreciate the insight there!

We do intend to implement IDLE_TIMEOUT eventually, but it hasn't been considered a high enough priority for the team to spend time on it so far.

Understandable! Hopefully if others are in a similar situation they fall on this discussion. I did find some other discussions but they didn't cover in this depth nor the future plans.

Can you explain more about what you're trying to do with your cache and how your deployment looks from a high level? Are you creating multiple ClientConns connected to the same service but different backends? We do have advanced load balancing features in gRPC that could potentially help with this and avoid the need to keep your own cache.

It's not a microservice architecture, but it is the same service implemented on many unrelated instances. We expect the load will be mostly all at once and then infrequently if ever afterwards. The idea behind the cache would be to keep the connection open for an endpoint and remove it from the cache after N minutes of inactivity, notification of shutdown, or connection loss. I'm not aware of any existing solutions, inside the first party ecosystem or third party.

I make use of clientside loadbalancing for microservice style deployments. Is there others I should know of you can send documentation to?

dfawley commented 2 years ago

The idea behind the cache would be to keep the connection open for an endpoint and remove it from the cache after N minutes of inactivity, notification of shutdown, or connection loss. I'm not aware of any existing solutions, inside the first party ecosystem or third party.

How do your servers expose their addresses for clients? E.g. are they all behind the same DNS name, do they each have their own, or do you not use DNS at all?

I make use of clientside loadbalancing for microservice style deployments. Is there others I should know of you can send documentation to?

We used to recommend gRPC-LB for more advanced scenarios like this, but that is now deprecated (or effectively deprecated) in favor of xDS. Are you familiar with Istio? I'm not sure if it has a feature quite like this, but it is possible.

If you are tolerant of using an experimental API, a custom client-side LB policy could definitely do things like this (https://pkg.go.dev/google.golang.org/grpc/balancer).

GTB3NW commented 2 years ago

How do your servers expose their addresses for clients? E.g. are they all behind the same DNS name, do they each have their own, or do you not use DNS at all?

We don't use DNS in this case, each has its own unique IPv4 address. I've actually published it over here - https://github.com/GalaxiteMC/grpcclientconncache - The API isn't exactly the most beautiful but hopefully it demonstrates the purpose. Essentially the purpose is to connect to a known ip port pair and try to do so only once. I'd still like to implement a client based timeout to trigger the cleanup of connections but as you can see from that repo as long as the server implements a timeout there should be a timely cleanup of the cache, otherwise it will at least close connections in network failures and on remote closures.

We used to recommend gRPC-LB for more advanced scenarios like this, but that is now deprecated (or effectively deprecated) in favor of xDS. Are you familiar with Istio? I'm not sure if it has a feature quite like this, but it is possible.

I am aware of istio and xDS yeah, I had actually considered xds but brushed it off, we unfortunately have blockers on being able to deploy istio. But with that said and a purely theoretical a single connection to istio could route a request to a specific location, is it extensible to do a routing decision via a third party service, for example given a uuid in the header it could then route to the correct backend? I'm asking a lot here and probably going a little off topic, so I appreciate if you'd rather me ask this elsewhere.

If you are tolerant of using an experimental API, a custom client-side LB policy could definitely do things like this (https://pkg.go.dev/google.golang.org/grpc/balancer).

Super neat! Didn't know that existed. Would the ctx be the one passed into the RPC call? So we could theoretically pass the routing info in and PickInfo would return the subconn to use? My only problem with this is you'd need to have subconns for every possible endpoint, unless those can be span up/down as needed, if that is the case I could actually port the logic from the above project to this API.

I really appreciate the responses they're super informative each time, so I want to say thanks for spending the time to respond :)

dfawley commented 2 years ago

We don't use DNS in this case, each has its own unique IPv4 address. I've actually published it over here - GalaxiteMC/grpcclientconncache

How do you know these addresses, though? Are they hard-coded into the application or is there another discovery mechanism?

a single connection to istio could route a request to a specific location, is it extensible to do a routing decision via a third party service, for example given a uuid in the header it could then route to the correct backend?

The connection to istio is used to transfer the configuration data for the service. That configuration could say "route requests matching this header (value/regex) to this backend", e.g. If you need custom logic or a custom remote load balancer, then you would need to implement your own LB policy.

Super neat! Didn't know that existed. Would the ctx be the one passed into the RPC call? So we could theoretically pass the routing info in and PickInfo would return the subconn to use? My only problem with this is you'd need to have subconns for every possible endpoint, unless those can be span up/down as needed, if that is the case I could actually port the logic from the above project to this API.

Yes, the LB policy (balancer.Balancer) has control of what subconns to create and how to assign them based on the addresses given to it by the name resolver (resolver.Resolver). The connectivity state changes of the subconns are pushed to it, so you could implement something that removes any subconn when it goes to Idle (or just leave it there and call Connect on it when it's needed in the future). These APIs are experimental, meaning we reserve the right to break API compatibility in a minor release. They don't change often, but we do have a few minor changes we know we want to make here in the next 6-12 months.

GTB3NW commented 2 years ago

How do you know these addresses, though? Are they hard-coded into the application or is there another discovery mechanism?

Another discovery mechanism. Think of a matchmaking system in a game which supplies an IP, then we'd connect to said IP for certain RPC calls. There's a strong likelihood that an instance will get called frequently in a short period of time during that matchmaking period, then unlikely shortly afterwards, other than perhaps the odd call.

The connection to istio is used to transfer the configuration data for the service. That configuration could say "route requests matching this header (value/regex) to this backend", e.g. If you need custom logic or a custom remote load balancer, then you would need to implement your own LB policy.

Ah so it's a push style system? A service will push routing config for itself to istio, then does istio forward that or is the idea to maintain connection to the envoy part of istio which then does the routing?

Yes, the LB policy (balancer.Balancer) has control of what subconns to create and how to assign them based on the addresses given to it by the name resolver (resolver.Resolver). The connectivity state changes of the subconns are pushed to it, so you could implement something that removes any subconn when it goes to Idle (or just leave it there and call Connect on it when it's needed in the future).

I think I understand here, but I do have a few questions. Since there'd be no DNS in this case, we'd be connecting to N amount of addresses which are unknown at creation time, what would I dial? It sounds silly but with no initial target it wouldn't make sense. Unless I can supply addresses/names on the fly and it creates sub connections for me? Good to know I can integrate with the connectivity API, I think I'll still opt to remove since calling connect would still do the same steps essentially and I'd have smaller chance of leaks over time. I've been reading this document which seems like a good rundown of the API (https://www.sobyte.net/post/2022-03/golang-grpc/#3-customizing-the-client-balancer), do you have other reading suggestions? I've googled the package path and not found a whole lot.

These APIs are experimental, meaning we reserve the right to break API compatibility in a minor release. They don't change often, but we do have a few minor changes we know we want to make here in the next 6-12 months.

Understood! If I go down this route I'll try to keep it up to date and provide feedback too. Off the top of your head, would any of those changes affect this plan?

dfawley commented 2 years ago

Another discovery mechanism

OK, so in this kind of situation you'd probably want to make your own name resolver to supply the addresses to the LB policy, then: https://pkg.go.dev/google.golang.org/grpc@v1.48.0/resolver#Resolver. Typically the name resolver also produces the service config, which contains the LB policy's configuration.

Ah so it's a push style system? A service will push routing config for itself to istio, then does istio forward that or is the idea to maintain connection to the envoy part of istio which then does the routing?

The service owner sets the routing and load balancing configuration in Istio. Then the clients and servers connect to Istio to retrieve that configuration. You can use Envoy as a sidecar proxy, or use gRPC as a proxyless xDS implementation that supports the same protocol but doesn't require extra hops or managing the proxy.

I think I understand here, but I do have a few questions. Since there'd be no DNS in this case, we'd be connecting to N amount of addresses which are unknown at creation time, what would I dial?

The dial target is a URI containing the name resolver as the scheme and a name the resolver can use to look up the corresponding addresses: https://github.com/grpc/grpc/blob/master/doc/naming.md

So with DNS, your "normal dial target" is: dns:<name> or dns://<dns server>/<name>. Then the DNS name resolver looks up <name> and returns addresses for it to gRPC, which then forwards it to the configured LB policy (pick_first by default). (The DNS resolver is also the default resolver, which is why you can dial <name> alone; we assume the dns scheme if the target doesn't parse as a URI.)

So you'd make a custom resolver, register it, and then do: mynameresolver:[//<authority>/]<name>.

If you only have a single "name" that makes sense then you could just omit it and dial mynameresolver:. (I think that might work, but I'm actually not sure without testing/reading code.)

These docs are extremely old so some APIs may have changed, but the concepts still apply:

https://github.com/grpc/proposal/blob/master/L9-go-resolver-balancer-API.md

Off the top of your head, would any of those changes affect this plan?

We're not looking to remove any functionality, just restructure the API a bit -- move methods around, consolidate resolver.ClientConn.UpdateState and ReportError into a single function, etc.

GTB3NW commented 2 years ago

@dfawley thank you so much, I think I have all I need to continue now. Happy to close this issue since the original sounds like it's being tracked elsewhere?

github-actions[bot] commented 2 years ago

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

dfawley commented 2 years ago

Sounds good; let us know if you have any further questions.