spire-{agent,server}: allow more gRPC connection configuration

sorindumitru commented 8 months ago

The current spire-agent resolver and load balancer are configured to be DNS with round robin load balancing. gRPC implements that round robin as connecting to each address that the resolver presents and then each RPC is round robined across the connections. Additionally spire-server forces disconnects every 3 minutes to force clients to resolve DNS, in case there is a change in the list of servers. This works ok in most situations and provides a good behaviour out of the box, but it would be useful for some deployments to be able to configure this. For example:

using an xDS resolver to have better control over resolution and being able to dynamically push updates. Can also be used to route a specific node to the right “downstream” spire-server group, based on the node id it sends in the xDS requests.
A pick_first load balancing option would allow similar configurations to the above, but with less flexibility to the administrator.

This is somewhat similar to what was requested in https://github.com/spiffe/spire/issues/4696, but goes a bit further. It should also include being able to tune the server side of things. Currently it forces reconnections every 3 minute to force DNS resolution. This leads to spire-servers having to do periodic TLS handshakes with agents, even though there's nothing usually requiring it (e.g. DNS updates). With something like xDS it is no longer required since it can push updates to spire-agent, so it would be good to not have to do a bunch of TLS handshakes for nothing.

evan2645 commented 8 months ago

Thanks for opening this @sorindumitru - would you mind sharing a little more about the motivation? Is it strictly TLS-related overhead? And if so, do you have a feel for what percentage of work is related to this versus other functions (e.g. signing SVIDs) ? A little more background will help a lot in figuring out a path forward

sorindumitru commented 8 months ago

There's a bunch of things that I would like to be able to do, but can't currently.

route agents to spire-server instances that are closer to it, in a multi-datacenter environment, while still allowing it to fallback to servers in a different datacenter if the ones in its own datacenter are unavailable.
automatic routing of agents to the right downstream server in a nested deployment (which has the portion of registration entries that the agent cares about).

xDS and likely other gRPC configuration options would help us to achieve this. With xDS you can also get DNS updates to happen whenever you want(at least if you implement them, the agent maintains a connection to the xDS server and ), so I think it’s also worth disabling the forced disconnect from the server side since the only point of it is to trigger a DNS resolution. As we have seen in https://github.com/spiffe/spire/issues/4696 that would help a lot. In that particular case it would go from 100k/180=~550 TLS handshakes per second (for each server instance) to 100k/(attestation period), which is usually bigger than 3 minutes.

I can try to get some data about the connection and TLS overhead, but I think what Ken saw in https://github.com/spiffe/spire/issues/4696 shows that there can be some big improvements.

Do the maintainers have any other concerns about this added configurability? Are they concerned that this would add more burden on their triaging efforts?

immutableT commented 8 months ago

+1 I think the https://github.com/spiffe/spire/issues/4696 surfaced the right issue, but the suggested fix (of changing the default behavior) was too intrusive. Instead, (as this @sorindumitru suggests) exposing the load-balancing options as a configuration parameter would help customers with many agents. Note: this issue is the only reason we must maintain a fork.

sorindumitru commented 7 months ago

To have a concrete example here of what we’d kind of configuration we’d need to have available:

server_endpoint - you can tell it something like dns://server:8081 or ipv4://10.0.1.1,10.0.1.2 or xds://spire-server:8081
- In the case of xds it would also need to enable using the service config specified by the resolver, otherwise it’s not very useable.
default_service_config -to specify the service configuration or if this opens too much (e.g. timeouts and retries) it could only specify the load balancing configuration

Some example scenarios:

using xds://spire-serve:8081 to use xDS for name resolution. You can use the XDS_BOOTSTRAP_CONFIG to send some information to the xDS server about the node, for example to know which
use default_service_config to specify that you want pick_first as the load balancing option, as other users have mentioned.
Combine both e.g. you don’t have an easy way to give SPIRE a DNS name and you specify ipv4://10.0.1.1,...,10.0.1.32 and pick_first as the load balancing option

azdagron commented 7 months ago

https://github.com/grpc/grpc/blob/master/doc/service_config.md

azdagron commented 6 months ago

We've gained consensus on adding a new configurable in SPIRE Agent to allow passing an opaque payload that passed as the load_balancing_config portion of the gRPC service config.

If the value is unset, the server will use the following (today's hard-coded default):

{ "loadBalancingConfig": [ { "round_robin": {} } ] }

If the value is set, the server will render and use the following:

{ "loadBalancingConfig": [ <configured value> ] }

spiffe / spire

spire-{agent,server}: allow more gRPC connection configuration #4990