Open sorindumitru opened 8 months ago
Thanks for opening this @sorindumitru - would you mind sharing a little more about the motivation? Is it strictly TLS-related overhead? And if so, do you have a feel for what percentage of work is related to this versus other functions (e.g. signing SVIDs) ? A little more background will help a lot in figuring out a path forward
There's a bunch of things that I would like to be able to do, but can't currently.
xDS and likely other gRPC configuration options would help us to achieve this. With xDS you can also get DNS updates to happen whenever you want(at least if you implement them, the agent maintains a connection to the xDS server and ), so I think it’s also worth disabling the forced disconnect from the server side since the only point of it is to trigger a DNS resolution. As we have seen in https://github.com/spiffe/spire/issues/4696 that would help a lot. In that particular case it would go from 100k/180=~550 TLS handshakes per second (for each server instance) to 100k/(attestation period), which is usually bigger than 3 minutes.
I can try to get some data about the connection and TLS overhead, but I think what Ken saw in https://github.com/spiffe/spire/issues/4696 shows that there can be some big improvements.
Do the maintainers have any other concerns about this added configurability? Are they concerned that this would add more burden on their triaging efforts?
+1 I think the https://github.com/spiffe/spire/issues/4696 surfaced the right issue, but the suggested fix (of changing the default behavior) was too intrusive. Instead, (as this @sorindumitru suggests) exposing the load-balancing options as a configuration parameter would help customers with many agents. Note: this issue is the only reason we must maintain a fork.
To have a concrete example here of what we’d kind of configuration we’d need to have available:
dns://server:8081
or ipv4://10.0.1.1,10.0.1.2
or xds://spire-server:8081
xds
it would also need to enable using the service config specified by the resolver, otherwise it’s not very useable.Some example scenarios:
We've gained consensus on adding a new configurable in SPIRE Agent to allow passing an opaque payload that passed as the load_balancing_config
portion of the gRPC service config.
If the value is unset, the server will use the following (today's hard-coded default):
{ "loadBalancingConfig": [ { "round_robin": {} } ] }
If the value is set, the server will render and use the following:
{ "loadBalancingConfig": [ <configured value> ] }
The current spire-agent resolver and load balancer are configured to be DNS with round robin load balancing. gRPC implements that round robin as connecting to each address that the resolver presents and then each RPC is round robined across the connections. Additionally spire-server forces disconnects every 3 minutes to force clients to resolve DNS, in case there is a change in the list of servers. This works ok in most situations and provides a good behaviour out of the box, but it would be useful for some deployments to be able to configure this. For example:
This is somewhat similar to what was requested in https://github.com/spiffe/spire/issues/4696, but goes a bit further. It should also include being able to tune the server side of things. Currently it forces reconnections every 3 minute to force DNS resolution. This leads to spire-servers having to do periodic TLS handshakes with agents, even though there's nothing usually requiring it (e.g. DNS updates). With something like xDS it is no longer required since it can push updates to spire-agent, so it would be good to not have to do a bunch of TLS handshakes for nothing.