Closed easwars closed 1 week ago
@easwars fyi. Let me know if we should still keep it re-opened
I briefly had a look at this. In failures, the logs stop just before the cluster_resolver
balancer creates child priority
balancers. Before creating the balancers, cluster_resolver
waits for both resolution mechanisms (DNS and EDS) to report a (possibly empty) list of endpoints: https://github.com/grpc/grpc-go/blob/cfd14baa8264cbeebf6308a7b68333c8c2fc6e86/xds/internal/balancer/clusterresolver/resource_resolver.go#L284-L312
I suspect either of of DNS or EDS doesn't resolve the service endpoints in the 5 sec deadline.
bad.ip.v4.address
which returns in NXDOMAIN since it doesn't have a pubic DNS record. This makes an actual DNS request which fails. I suspected that this lookup could be taking more than 5 secs during the failures. We could change this hostname to an invalid URL (e.g. bad%ip%v4%address) so that a DNS request is not sent at all. Locally this resolution took around 100ms to complete. I can't say for sure if this is the cause of the failures.onDone
to ack the updates) causing the watch to never receive the endpoint list, but after going through the code I couldn't find anything suspicious.There are other tests in the same file that still use real DNS. I saw Test/AggregateCluster_BadEDS_BadDNS
flake with a similar timeout. We need to mock DNS in the remaining tests too (similar to https://github.com/grpc/grpc-go/pull/7561).
https://github.com/grpc/grpc-go/actions/runs/10780783629/job/29897276983?pr=7498
Another failure for Test/AggregateCluster_BadEDS_BadDNS: https://github.com/grpc/grpc-go/actions/runs/11019297069/job/30601528257
I'll try to raise a PR with the fix.
https://github.com/grpc/grpc-go/actions/runs/9623935969/job/26546996260?pr=7342