spiffe / spire

The SPIFFE Runtime Environment
https://spiffe.io
Apache License 2.0
1.82k stars 478 forks source link

Agent GRPC configuration causes SRV DNS lookups which can fail #5563

Open keeganwitt opened 1 month ago

keeganwitt commented 1 month ago

Using grpc.WithDefaultServiceConfig(roundRobinServiceConfig) results in SRV DNS lookups. I believe this is because GRPC's code here will attempt to populate addresses to load balance between with both the SRV and A records if EnableSRVLookups is true, which it will be because the grpclb package is initialized.

However, when using an external load balancer (such as aws-load-balancer) and external DNS so that the agents collocated in the same pod as your downstream server can access the upstream server, it should not be using SRV records, but should instead be using A records, as those are the type AWS will create. This results in failed DNS lookups and excessive load on your DNS system. If you generate enough of these NXDOMAIN queries, there can be a significant expense in Route53.

The usage of this load balancer was introduced in #1061.

keeganwitt commented 1 month ago

Actually, in the case of Kubernetes, even for the agents in a daemonset communicating to the downstream server, A/AAA records will be typical rather than SRV records (see here).

I'm thinking the fix for this would be to add an option to the agent config to turn on/off the gRPC load balancing.

keeganwitt commented 1 month ago

https://github.com/spiffe/spire/blob/3b4de184dd308212014da3f68c8e8cbecb794b77/pkg/agent/attestor/node/node.go#L299 and https://github.com/spiffe/spire/blob/3b4de184dd308212014da3f68c8e8cbecb794b77/pkg/agent/client/dial.go#L67 were the two places that cause this.

sorindumitru commented 1 month ago

@keeganwitt There's #4990 to make that configuration more configurable. We've agreed on allowing more options than the default, maybe another option would be no configuration.

amartinezfayo commented 1 month ago

Thank you @keeganwitt for opening this, and thank you @sorindumitru for pointing to the issue that this depends on. This depends on #4990 to be able to fix it.