Flaky test: Test/AggregateCluster_BadDNS_GoodEDS

easwars commented 3 months ago

--- FAIL: Test (12.63s)
    --- FAIL: Test/AggregateCluster_BadDNS_GoodEDS (5.01s)
        aggregate_cluster_test.go:844: Created new snapshot cache...
        tlogger.go:116: INFO server.go:685 [core] [Server #400]Server created  (t=+349.043µs)
        aggregate_cluster_test.go:844: Registered Aggregated Discovery Service (ADS)...
        aggregate_cluster_test.go:844: xDS management server serving at: 127.0.0.1:42115...
        tlogger.go:116: INFO server.go:881 [core] [Server #400 ListenSocket #401]ListenSocket created  (t=+554.614µs)
        tlogger.go:116: INFO server.go:685 [core] [Server #402]Server created  (t=+1.708125ms)
        stubserver.go:265: Started test service backend at "127.0.0.1:44577"
        server.go:229: Created new resource snapshot...
        server.go:235: Updated snapshot cache with resource snapshot...
        tlogger.go:116: INFO bootstrap.go:569 [xds] [xds-bootstrap] Bootstrap config for creating xds-client: &{xDSServers:[0xb074820] cpcs:map[client-side-certificate-provider-instance:{PluginName:file_watcher Config:[123 10 34 99 101 114 116 105 102 105 99 97 116 101 95 102 105 108 101 34 58 32 34 47 116 109 112 47 116 101 115 116 67 108 105 101 110 116 83 105 100 101 88 68 83 52 49 54 56 49 51 55 50 48 47 99 101 114 116 46 112 101 109 34 44 10 34 112 114 105 118 97 116 101 95 107 101 121 95 102 105 108 101 34 58 32 34 47 116 109 112 47 116 101 115 116 67 108 105 101 110 116 83 105 100 101 88 68 83 52 49 54 56 49 51 55 50 48 47 107 101 121 46 112 101 109 34 44 10 34 99 97 95 99 101 114 116 105 102 105 99 97 116 101 95 102 105 108 101 34 58 32 34 47 116 109 112 47 116 101 115 116 67 108 105 101 110 116 83 105 100 101 88 68 83 52 49 54 56 49 51 55 50 48 47 99 97 46 112 101 109 34 44 10 34 114 101 102 114 101 115 104 95 105 110 116 101 114 118 97 108 34 58 32 34 54 48 48 115 34 10 125]} server-side-certificate-provider-instance:{PluginName:file_watcher Config:[123 10 34 99 101 114 116 105 102 105 99 97 116 101 95 102 105 108 101 34 58 32 34 47 116 109 112 47 116 101 115 116 83 101 114 118 101 114 83 105 100 101 88 68 83 50 55 48 50 52 51 54 48 54 55 47 99 101 114 116 46 112 101 109 34 44 10 34 112 114 105 118 97 116 101 95 107 101 121 95 102 105 108 101 34 58 32 34 47 116 109 112 47 116 101 115 116 83 101 114 118 101 114 83 105 100 101 88 68 83 50 55 48 50 52 51 54 48 54 55 47 107 101 121 46 112 101 109 34 44 10 34 99 97 95 99 101 114 116 105 102 105 99 97 116 101 95 102 105 108 101 34 58 32 34 47 116 109 112 47 116 101 115 116 83 101 114 118 101 114 83 105 100 101 88 68 83 50 55 48 50 52 51 54 48 54 55 47 99 97 46 112 101 109 34 44 10 34 114 101 102 114 101 115 104 95 105 110 116 101 114 118 97 108 34 58 32 34 54 48 48 115 34 10 125]}] serverListenerResourceNameTemplate:grpc/server?xds.resource.listening_address=%s clientDefaultListenerResourceNameTemplate:%s authorities:map[] node:{ID:e407ab47-3e4d-4bf9-88f2-efbd30f041db Cluster: Locality:{Region: Zone: SubZone:} Metadata:<nil> userAgentName:gRPC Go userAgentVersionType:{UserAgentVersion:1.66.0-dev} clientFeatures:[envoy.lb.does_not_support_overprovisioning xds.config.resource-in-sotw]} certProviderConfigs:map[client-side-certificate-provider-instance:0xb0eaa00 server-side-certificate-provider-instance:0xb0eaa60]}  (t=+1.98356ms)
        tlogger.go:116: INFO client_new.go:87 [xds] [xds-client 0xaf9a870] Created client to xDS management server: passthrough:///127.0.0.1:42115-insecure-  (t=+2.054252ms)
        tlogger.go:116: INFO clientconn.go:1687 [core] original dial target is: "whatever:///test.service"  (t=+2.107552ms)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #403]Channel created  (t=+2.133601ms)
        tlogger.go:116: INFO clientconn.go:191 [core] [Channel #403]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"whatever", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/test.service", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}  (t=+2.161934ms)
        tlogger.go:116: INFO clientconn.go:192 [core] [Channel #403]Channel authority set to "test.service"  (t=+2.182863ms)
        tlogger.go:116: INFO server.go:881 [core] [Server #402 ListenSocket #404]ListenSocket created  (t=+2.260999ms)
        tlogger.go:116: INFO resolver_wrapper.go:197 [core] [Channel #403]Resolver state updated: {
              "Addresses": null,
              "Endpoints": [],
              "ServiceConfig": {
                "Config": {
                  "Config": null,
                  "Methods": {}
                },
                "Err": null
              },
              "Attributes": {
                "\u003c%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)\u003e": "\u003c0xaffc668\u003e"
              }
            } (service config updated)  (t=+2.337182ms)
        tlogger.go:116: INFO balancer_wrapper.go:103 [core] [Channel #403]Channel switches to new LB policy "cds_experimental"  (t=+2.386634ms)
        tlogger.go:116: INFO cdsbalancer.go:109 [xds] [cds-lb 0xac0[200](https://github.com/grpc/grpc-go/actions/runs/9623935969/job/26546996260?pr=7342#step:8:201)8] Created  (t=+2.41107ms)
        tlogger.go:116: INFO cdsbalancer.go:121 [xds] [cds-lb 0xac02008] xDS credentials in use: false  (t=+2.428001ms)
        tlogger.go:116: INFO cdsbalancer.go:290 [xds] [cds-lb 0xac02008] Received balancer config update: {
              "LoadBalancingConfig": null,
              "Cluster": "cluster-my-service-client-side-xds"
            }  (t=+2.458438ms)
        tlogger.go:116: INFO clientconn.go:1687 [core] original dial target is: "passthrough:///127.0.0.1:42115"  (t=+2.502781ms)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #405]Channel created  (t=+2.54506ms)
        tlogger.go:116: INFO clientconn.go:191 [core] [Channel #405]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/127.0.0.1:42115", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}  (t=+2.5717ms)
        tlogger.go:116: INFO clientconn.go:192 [core] [Channel #405]Channel authority set to "127.0.0.1:42115"  (t=+2.591397ms)
        tlogger.go:116: INFO resolver_wrapper.go:197 [core] [Channel #405]Resolver state updated: {
              "Addresses": [
                {
                  "Addr": "127.0.0.1:42115",
                  "ServerName": "",
                  "Attributes": null,
                  "BalancerAttributes": null,
                  "Metadata": null
                }
              ],
              "Endpoints": [
                {
                  "Addresses": [
                    {
                      "Addr": "127.0.0.1:42115",
                      "ServerName": "",
                      "Attributes": null,
                      "BalancerAttributes": null,
                      "Metadata": null
                    }
                  ],
                  "Attributes": null
                }
              ],
              "ServiceConfig": null,
              "Attributes": null
            } (resolver returned new addresses)  (t=+2.649836ms)
        tlogger.go:116: INFO balancer_wrapper.go:103 [core] [Channel #405]Channel switches to new LB policy "pick_first"  (t=+2.685744ms)
        tlogger.go:116: INFO clientconn.go:852 [core] [Channel #405 SubChannel #406]Subchannel created  (t=+2.712143ms)
        tlogger.go:116: INFO clientconn.go:539 [core] [Channel #405]Channel Connectivity change to CONNECTING  (t=+2.733874ms)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #405]Channel exiting idle mode  (t=+2.754623ms)
        tlogger.go:116: INFO transport.go:238 [xds] [xds-client 0xaf9a870] [passthrough:///127.0.0.1:42115] Created transport to server "passthrough:///127.0.0.1:42115"  (t=+2.779068ms)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #403]Channel exiting idle mode  (t=+2.808233ms)
        tlogger.go:116: INFO clientconn.go:1213 [core] [Channel #405 SubChannel #406]Subchannel Connectivity change to CONNECTING  (t=+2.858847ms)
        tlogger.go:116: INFO clientconn.go:1329 [core] [Channel #405 SubChannel #406]Subchannel picks a new address "127.0.0.1:42115" to connect  (t=+2.877192ms)
        tlogger.go:116: INFO clientconn.go:1213 [core] [Channel #405 SubChannel #406]Subchannel Connectivity change to READY  (t=+3.823602ms)
        tlogger.go:116: INFO clientconn.go:539 [core] [Channel #405]Channel Connectivity change to READY  (t=+3.84935ms)
        tlogger.go:116: INFO transport.go:337 [xds] [xds-client 0xaf9a870] [passthrough:///127.0.0.1:4[211](https://github.com/grpc/grpc-go/actions/runs/9623935969/job/26546996260?pr=7342#step:8:212)5] ADS stream created  (t=+3.918126ms)
        logging.go:30: nodeID ["e407ab47-3e4d-4bf9-88f2-efbd30f041db" "type.googleapis.com/envoy.config.cluster.v3.Cluster" ["cluster-my-service-client-side-xds"] map[] ["cluster-my-service-client-side-xds"]] requested %!s(MISSING)%!v(MISSING) and known %!v(MISSING). Diff %!v(MISSING)
        logging.go:30: respond [type.googleapis.com/envoy.config.cluster.v3.Cluster [cluster-my-service-client-side-xds]  1]%!v(MISSING) version %!q(MISSING) with version %!q(MISSING)
        logging.go:30: nodeID ["e407ab47-3e4d-4bf9-88f2-efbd30f041db" "type.googleapis.com/envoy.config.cluster.v3.Cluster" ["cluster-my-service-client-side-xds"] map["cluster-my-service-client-side-xds":{}] []] requested %!s(MISSING)%!v(MISSING) and known %!v(MISSING). Diff %!v(MISSING)
        logging.go:30: open watch [1 %!d(string=type.googleapis.com/envoy.config.cluster.v3.Cluster) [%!d(string=cluster-my-service-client-side-xds)] %!d(string=e407ab47-3e4d-4bf9-88f2-efbd30f041db) %!d(string=1)] for %!s(MISSING)%!v(MISSING) from nodeID %!q(MISSING), version %!q(MISSING)
        tlogger.go:116: INFO cdsbalancer.go:414 [xds] [cds-lb 0xac02008] Received Cluster resource: {
              "ClusterType": 2,
              "ClusterName": "cluster-my-service-client-side-xds",
              "EDSServiceName": "",
              "LRSServerConfig": null,
              "SecurityCfg": null,
              "MaxRequests": null,
              "DNSHostName": "",
              "PrioritizedClusterNames": [
                "cluster-my-service-client-side-xds-dns",
                "cluster-my-service-client-side-xds-eds"
              ],
              "LBPolicy": [
                {
                  "xds_wrr_locality_experimental": {
                    "childPolicy": [
                      {
                        "round_robin": {}
                      }
                    ]
                  }
                }
              ],
              "OutlierDetection": null,
              "Raw": {
                "type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
                "value": "CiJjbHVzdGVyLW15LXNlcnZpY2UtY2xpZW50LXNpZGUteGRzsgK5AQoYZW52b3kuY2x1c3RlcnMuYWdncmVnYXRlEpwBCkh0eXBlLmdvb2dsZWFwaXMuY29tL2Vudm95LmV4dGVuc2lvbnMuY2x1c3RlcnMuYWdncmVnYXRlLnYzLkNsdXN0ZXJDb25maWcSUAomY2x1c3Rlci1teS1zZXJ2aWNlLWNsaWVudC1zaWRlLXhkcy1kbnMKJmNsdXN0ZXItbXktc2VydmljZS1jbGllbnQtc2lkZS14ZHMtZWRz"
              },
              "TelemetryLabels": {
                "csm.service_name": "unknown",
                "csm.service_namespace_name": "unknown"
              }
            }  (t=+4.656547ms)
        logging.go:30: nodeID ["e407ab47-3e4d-4bf9-88f2-efbd30f041db" "type.googleapis.com/envoy.config.cluster.v3.Cluster" ["cluster-my-service-client-side-xds" "cluster-my-service-client-side-xds-dns"] map["cluster-my-service-client-side-xds":{}] ["cluster-my-service-client-side-xds-dns"]] requested %!s(MISSING)%!v(MISSING) and known %!v(MISSING). Diff %!v(MISSING)
        logging.go:30: respond [type.googleapis.com/envoy.config.cluster.v3.Cluster [cluster-my-service-client-side-xds cluster-my-service-client-side-xds-dns] 1 1]%!v(MISSING) version %!q(MISSING) with version %!q(MISSING)
        tlogger.go:116: INFO cdsbalancer.go:414 [xds] [cds-lb 0xac02008] Received Cluster resource: {
              "ClusterType": 1,
              "ClusterName": "cluster-my-service-client-side-xds-dns",
              "EDSServiceName": "",
              "LRSServerConfig": null,
              "SecurityCfg": null,
              "MaxRequests": null,
              "DNSHostName": "bad.ip.v4.address:8080",
              "PrioritizedClusterNames": null,
              "LBPolicy": [
                {
                  "xds_wrr_locality_experimental": {
                    "childPolicy": [
                      {
                        "round_robin": {}
                      }
                    ]
                  }
                }
              ],
              "OutlierDetection": null,
              "Raw": {
                "type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
                "value": "CiZjbHVzdGVyLW15LXNlcnZpY2UtY2xpZW50LXNpZGUteGRzLWRuc4oCIBIeEhwKGgoYChYSEWJhZC5pcC52NC5hZGRyZXNzGJA/EAI="
              },
              "TelemetryLabels": {
                "csm.service_name": "unknown",
                "csm.service_namespace_name": "unknown"
              }
            }  (t=+5.072148ms)
        logging.go:30: nodeID ["e407ab47-3e4d-4bf9-88f2-efbd30f041db" "type.googleapis.com/envoy.config.cluster.v3.Cluster" ["cluster-my-service-client-side-xds-dns" "cluster-my-service-client-side-xds-eds" "cluster-my-service-client-side-xds"] map["cluster-my-service-client-side-xds":{} "cluster-my-service-client-side-xds-dns":{}] ["cluster-my-service-client-side-xds-eds"]] requested %!s(MISSING)%!v(MISSING) and known %!v(MISSING). Diff %!v(MISSING)
        logging.go:30: respond [type.googleapis.com/envoy.config.cluster.v3.Cluster [cluster-my-service-client-side-xds-dns cluster-my-service-client-side-xds-eds cluster-my-service-client-side-xds] 1 1]%!v(MISSING) version %!q(MISSING) with version %!q(MISSING)
        tlogger.go:116: INFO cdsbalancer.go:414 [xds] [cds-lb 0xac02008] Received Cluster resource: {
              "ClusterType": 0,
              "ClusterName": "cluster-my-service-client-side-xds-eds",
              "EDSServiceName": "endpoints-my-service-client-side-xds",
              "LRSServerConfig": null,
              "SecurityCfg": null,
              "MaxRequests": null,
              "DNSHostName": "",
              "PrioritizedClusterNames": null,
              "LBPolicy": [
                {
                  "xds_wrr_locality_experimental": {
                    "childPolicy": [
                      {
                        "round_robin": {}
                      }
                    ]
                  }
                }
              ],
              "OutlierDetection": null,
              "Raw": {
                "type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
                "value": "CiZjbHVzdGVyLW15LXNlcnZpY2UtY2xpZW50LXNpZGUteGRzLWVkcxoqCgIaABIkZW5kcG9pbnRzLW15LXNlcnZpY2UtY2xpZW50LXNpZGUteGRzEAM="
              },
              "TelemetryLabels": {
                "csm.service_name": "unknown",
                "csm.service_namespace_name": "unknown"
              }
            }  (t=+5.371728ms)
        tlogger.go:116: INFO clusterresolver.go:85 [xds] [xds-cluster-resolver-lb 0xacedce8] Created  (t=+5.409318ms)
        tlogger.go:116: INFO cdsbalancer.go:455 [xds] [cds-lb 0xac02008] Created child policy 0xacedce8 of type cluster_resolver_experimental  (t=+5.430217ms)
        logging.go:30: nodeID ["e407ab47-3e4d-4bf9-88f2-efbd30f041db" "type.googleapis.com/envoy.config.cluster.v3.Cluster" ["cluster-my-service-client-side-xds-eds" "cluster-my-service-client-side-xds" "cluster-my-service-client-side-xds-dns"] map["cluster-my-service-client-side-xds":{} "cluster-my-service-client-side-xds-dns":{} "cluster-my-service-client-side-xds-eds":{}] []] requested %!s(MISSING)%!v(MISSING) and known %!v(MISSING). Diff %!v(MISSING)
        tlogger.go:116: INFO clusterresolver.go:187 [xds] [xds-cluster-resolver-lb 0xacedce8] Received new balancer config: {
              "discoveryMechanisms": [
                {
                  "cluster": "cluster-my-service-client-side-xds-dns",
                  "type": "LOGICAL_DNS",
                  "dnsHostname": "bad.ip.v4.address:8080",
                  "outlierDetection": {},
                  "telemetryLabels": {
                    "csm.service_name": "unknown",
                    "csm.service_namespace_name": "unknown"
                  }
                },
                {
                  "cluster": "cluster-my-service-client-side-xds-eds",
                  "edsServiceName": "endpoints-my-service-client-side-xds",
                  "outlierDetection": {},
                  "telemetryLabels": {
                    "csm.service_name": "unknown",
                    "csm.service_namespace_name": "unknown"
                  }
                }
              ],
              "xdsLbPolicy": [
                {
                  "xds_wrr_locality_experimental": {
                    "childPolicy": [
                      {
                        "round_robin": {}
                      }
                    ]
                  }
                }
              ]
            }  (t=+5.537067ms)
        logging.go:30: open watch [2 %!d(string=type.googleapis.com/envoy.config.cluster.v3.Cluster) [%!d(string=cluster-my-service-client-side-xds-eds) %!d(string=cluster-my-service-client-side-xds) %!d(string=cluster-my-service-client-side-xds-dns)] %!d(string=e407ab47-3e4d-4bf9-88f2-efbd30f041db) %!d(string=1)] for %!s(MISSING)%!v(MISSING) from nodeID %!q(MISSING), version %!q(MISSING)
        logging.go:30: nodeID ["e407ab47-3e4d-4bf9-88f2-efbd30f041db" "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment" ["endpoints-my-service-client-side-xds"] map[] ["endpoints-my-service-client-side-xds"]] requested %!s(MISSING)%!v(MISSING) and known %!v(MISSING). Diff %!v(MISSING)
        logging.go:30: respond [type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment [endpoints-my-service-client-side-xds]  1]%!v(MISSING) version %!q(MISSING) with version %!q(MISSING)
        logging.go:30: nodeID ["e407ab47-3e4d-4bf9-88f2-efbd30f041db" "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment" ["endpoints-my-service-client-side-xds"] map["endpoints-my-service-client-side-xds":{}] []] requested %!s(MISSING)%!v(MISSING) and known %!v(MISSING). Diff %!v(MISSING)
        logging.go:30: open watch [3 %!d(string=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment) [%!d(string=endpoints-my-service-client-side-xds)] %!d(string=e407ab47-3e4d-4bf9-88f2-efbd30f041db) %!d(string=1)] for %!s(MISSING)%!v(MISSING) from nodeID %!q(MISSING), version %!q(MISSING)
        aggregate_cluster_test.go:887: EmptyCall() failed: rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded
        tlogger.go:116: WARNING authority.go:420 [xds] [xds-client 0xaf9a870] [passthrough:///127.0.0.1:42115] Watchers not notified since ADS stream failed after having received at least one response: rpc error: code = Canceled desc = context canceled  (t=+5.002196321s)
        tlogger.go:116: WARNING transport.go:479 [xds] [xds-client 0xaf9a870] [passthrough:///127.0.0.1:42115] ADS stream closed: rpc error: code = Canceled desc = context canceled  (t=+5.002221188s)
        tlogger.go:116: INFO clientconn.go:539 [core] [Channel #405]Channel Connectivity change to SHUTDOWN  (t=+5.002271913s)
        tlogger.go:116: INFO resolver_wrapper.go:100 [core] [Channel #405]Closing the name resolver  (t=+5.002300566s)
        tlogger.go:116: INFO balancer_wrapper.go:135 [core] [Channel #405]ccBalancerWrapper: closing  (t=+5.00232392s)
        tlogger.go:116: INFO clientconn.go:1[213](https://github.com/grpc/grpc-go/actions/runs/9623935969/job/26546996260?pr=7342#step:8:214) [core] [Channel #405 SubChannel #406]Subchannel Connectivity change to SHUTDOWN  (t=+5.002354617s)
        tlogger.go:116: INFO clientconn.go:1560 [core] [Channel #405 SubChannel #406]Subchannel deleted  (t=+5.002375276s)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #405]Channel deleted  (t=+5.00253782s)
        tlogger.go:116: INFO clientimpl.go:100 [xds] [xds-client 0xaf9a870] Shutdown  (t=+5.002565051s)
        tlogger.go:116: INFO clientconn.go:539 [core] [Channel #403]Channel Connectivity change to SHUTDOWN  (t=+5.002591941s)
        tlogger.go:116: INFO resolver_wrapper.go:100 [core] [Channel #403]Closing the name resolver  (t=+5.00260756s)
        tlogger.go:116: INFO balancer_wrapper.go:135 [core] [Channel #403]ccBalancerWrapper: closing  (t=+5.002625464s)
        tlogger.go:116: INFO dns_resolver.go:282 [dns] dns: A record lookup error: lookup bad.ip.v4.address on 127.0.0.53:53: dial udp 127.0.0.53:53: operation was canceled  (t=+5.006316[226](https://github.com/grpc/grpc-go/actions/runs/9623935969/job/26546996260?pr=7342#step:8:227)s)
        tlogger.go:116: INFO clusterresolver.go:336 [xds] [xds-cluster-resolver-lb 0xacedce8] Shutdown  (t=+5.006368183s)
        tlogger.go:116: INFO cdsbalancer.go:380 [xds] [cds-lb 0xac02008] Shutdown  (t=+5.006384544s)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #403]Channel deleted  (t=+5.006416885s)
        tlogger.go:116: INFO server.go:817 [core] [Server #402 ListenSocket #404]ListenSocket deleted  (t=+5.006470555s)
        tlogger.go:116: INFO server.go:817 [core] [Server #400 ListenSocket #401]ListenSocket deleted  (t=+5.006518535s)
FAIL
FAIL    google.golang.org/grpc/xds/internal/balancer/clusterresolver/e2e_test   22.379s

https://github.com/grpc/grpc-go/actions/runs/9623935969/job/26546996260?pr=7342

purnesh42H commented 3 months ago

No failures in 100 re-runs
No failures 10000 re-runs

@easwars fyi. Let me know if we should still keep it re-opened

easwars commented 2 months ago

https://github.com/grpc/grpc-go/actions/runs/9912020526/job/27385944817?pr=7408

arjan-bal commented 2 months ago

https://github.com/grpc/grpc-go/actions/runs/9921494803/job/27409525025?pr=7411

easwars commented 1 month ago

https://github.com/grpc/grpc-go/actions/runs/10206168575/job/28238482724?pr=7476

zasweq commented 1 month ago

https://github.com/grpc/grpc-go/actions/runs/10271395115/job/28421255501

purnesh42H commented 1 month ago

https://github.com/grpc/grpc-go/actions/runs/10477120740/job/29017521670?pr=7468

arjan-bal commented 1 month ago

I briefly had a look at this. In failures, the logs stop just before the cluster_resolver balancer creates child priority balancers. Before creating the balancers, cluster_resolver waits for both resolution mechanisms (DNS and EDS) to report a (possibly empty) list of endpoints: https://github.com/grpc/grpc-go/blob/cfd14baa8264cbeebf6308a7b68333c8c2fc6e86/xds/internal/balancer/clusterresolver/resource_resolver.go#L284-L312

I suspect either of of DNS or EDS doesn't resolve the service endpoints in the 5 sec deadline.

DNS tries to resolve a valid hostname bad.ip.v4.address which returns in NXDOMAIN since it doesn't have a pubic DNS record. This makes an actual DNS request which fails. I suspected that this lookup could be taking more than 5 secs during the failures. We could change this hostname to an invalid URL (e.g. bad%ip%v4%address) so that a DNS request is not sent at all. Locally this resolution took around 100ms to complete. I can't say for sure if this is the cause of the failures.
EDS registers a watch for Endpoints with the xds client. From the failure logs, we can see that the watch is being registered. I suspected there could be some deadlock (by blocking an event channel or not calling onDone to ack the updates) causing the watch to never receive the endpoint list, but after going through the code I couldn't find anything suspicious.

arjan-bal commented 3 weeks ago

There are other tests in the same file that still use real DNS. I saw Test/AggregateCluster_BadEDS_BadDNS flake with a similar timeout. We need to mock DNS in the remaining tests too (similar to https://github.com/grpc/grpc-go/pull/7561).

https://github.com/grpc/grpc-go/actions/runs/10780783629/job/29897276983?pr=7498

arjan-bal commented 1 week ago

Another failure for Test/AggregateCluster_BadEDS_BadDNS: https://github.com/grpc/grpc-go/actions/runs/11019297069/job/30601528257

I'll try to raise a PR with the fix.

grpc / grpc-go

Flaky test: Test/AggregateCluster_BadDNS_GoodEDS #7354