1.66.0 - increase in produced zero addresses

grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC

https://grpc.io

Apache License 2.0

21.08k stars 4.38k forks source link

1.66.0 - increase in produced zero addresses #7625

Closed rcrowe closed 1 month ago

rcrowe commented 1 month ago

What version of gRPC are you using?

v.1.66.0

What version of Go are you using (`go version`)?

1.22.5

What operating system (Linux, Windows, …) and version?

Alpine 3.14 K8s v1.30.1

What did you do?

Ran v.1.66.0 across a fleet of gRPC clients & servers, using xDS to talk to an in-house control plane.

xDS services configured with round_robin load-balancer.

Logs captured with:

GRPC_GO_LOG_VERBOSITY_LEVEL=99
GRPC_GO_LOG_SEVERITY_LEVEL=info

What did you expect to see?

Running v1.65.0 we observe no such errors in with our clients talking to our servers.

What did you see instead?

A % of requests will see...

rpc error: code = Unavailable desc = name resolver error: produced zero addresses

We observe the errors happening multiple times per-minute, for a few minutes & then the errors will drop for up to 5 minutes. It's frequent enough to reliably replicate the problem across all our environments should you need us to capture more information.

I've not yet understood the pattern of when it fails & how often, but have attached logs.tar.gz from a small timeframe when we observed multiple errors, for a single container.

We were focused on the falling calls to iam-policy-decision-api.auth:5000 during our debugging & by querying against our trace backend, observed the following timestamps where these errors occurred. Note logs are in UTC while traces bellow are in BST (UTC+1).

arjan-bal commented 1 month ago

Before each of the error logs, the listener resource for iam-machine-api.auth isn't found (maybe the servers restarted?). This is causing an empty resolver state to be created which results in the creation of a pickfirst leaf LB policy which is the default LB policy when none is specified. pickfirst is given the empty endpoint list and it logs the error we're seeing.

2024/09/12 07:38:25 INFO: [xds] [xds-resolver 0xc00918f320] Received resource-not-found-error for Listener resource "iam-machine-api.auth:5000"
2024/09/12 07:38:25 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "1789713",
  "resourceNames": [
    "iam-machine-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "1538097"
}
2024/09/12 07:38:25 INFO: [xds] [xds-resolver 0xc00918f320] For Listener resource "iam-machine-api.auth:5000" and RouteConfiguration resource "", generated service config: {
  "loadBalancingConfig": [
    {
      "xds_cluster_manager_experimental": {
        "children": {
          "cluster:iam-machine-api.auth.5000": {
            "childPolicy": [
              {
                "cds_experimental": {
                  "cluster": "iam-machine-api.auth.5000"
                }
              }
            ]
          }
        }
      }
    }
  ]
}
2024/09/12 07:38:25 INFO: [core] [Channel #9]Resolver state updated: {
  "Addresses": null,
  "Endpoints": [],
  "ServiceConfig": {
    "Config": {
      "Config": null,
      "Methods": {}
    },
    "Err": null
  },
  "Attributes": {
    "\u003c%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)\u003e": "\u003c0x0\u003e",
    "\u003c%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)\u003e": "\u003c0xc007d369b0\u003e"
  }
} ()
2024/09/12 07:38:25 INFO: [xds] [xds-cluster-manager-lb 0xc02deeda00] Update with config {
  "LoadBalancingConfig": null,
  "Children": {
    "cluster:iam-machine-api.auth.5000": {
      "ChildPolicy": [
        {
          "cds_experimental": {
            "LoadBalancingConfig": null,
            "Cluster": "iam-machine-api.auth.5000"
          }
        }
      ]
    }
  }
}, resolver state {Addresses:[] Endpoints:[] ServiceConfig:0xc02fff8580 Attributes:{"<%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)>": "<0x0>" , "<%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)>": "<0xc007d369b0>" }}
2024/09/12 07:38:25 INFO: [xds] [cds-lb 0xc02f3cc9c0] Received balancer config update: {
  "LoadBalancingConfig": null,
  "Cluster": "iam-machine-api.auth.5000"
}
2024/09/12 07:38:25 INFO: [core] [Channel #9]Resolver state updated: {
  "Addresses": null,
  "Endpoints": [],
  "ServiceConfig": {
    "Config": {
      "Config": null,
      "Methods": {}
    },
    "Err": null
  },
  "Attributes": null
} (service config updated)
2024/09/12 07:38:25 INFO: [core] [Channel #9]Channel switches to new LB policy "pick_first"
2024/09/12 07:38:25 INFO: [pick-first-lb] [pick-first-lb 0xc02fffbda0] Received error from the name resolver: produced zero addresses
2024/09/12 07:38:25 INFO: [core] [Channel #9]Channel Connectivity change to TRANSIENT_FAILURE
2024/09/12 07:38:25 INFO: [core] error from balancer.UpdateClientConnState: bad resolver state

Once the listener becomes available, pickfirst is removed. Still looking into whether anything changed about the handling of not found listeners in 1.66 which would cause these logs to show up.

rcrowe commented 1 month ago

Thanks @arjan-bal

The logs were captured from a service that would have been hitting a few services configured via xDS, however, during our investigation only focused on iam-policy-decision-api.auth so can only say with certainty that this endpoint was exhibiting the issue. It could well have been the same with the machine endpoint.

I'll take that away & try to dig into why the listener resource is missing as I see from the logs that's happening across both endpoints being discussed here; best I can tell is that none of the servers have been restarted for some time so it's not immediately clear why this would be missing.

ffilippopoulos commented 1 month ago

@arjan-bal Reading through the logs, it seems to me that the listener resource config for iam-machine-api.auth:5000 is served down the client and is added to it's cache:

2024/09/12 07:38:22 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] Resource type "ListenerResource" with name "iam-machine-api.auth:5000" added to cache

And then subsequent requests to the rest of the APIs are also successful and initiate watches. Then at some point after the client sends another ADS request for the same listener:

2024/09/12 07:38:26 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "1789812",
  "resourceNames": [
    "iam-machine-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "1538184"

which in fact receives an empty response:

2024/09/12 07:38:26 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
  "versionInfo": "1789812",
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "nonce": "1538184"
}

and triggers the resource-not-found-error error. This though should be as the xDS server thinks that the client is up to date and no update should be sent. The only change I can see that could have affected the call flow is the ADS stream flow control mechanism: https://github.com/grpc/grpc-go/pull/7458. Is there anything new we need to implement on the xDS server side to respect the new flow?

arjan-bal commented 1 month ago

Here are the events leading up to one of the resource-not-found errors:

Request for listener sent with version 1790189.

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "1790189",
"resourceNames": [
"iam-machine-api.auth:5000"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "1538494"
}

ADS response received with version 1790190. Since the version has changed, I believe this should be considered as a not found response instead of not updated. Based on the envoy docs

In the SotW protocol variants, all resource types except for Listener and Cluster are grouped into responses in the same way as in the incremental protocol variants. However, Listener and Cluster resource types are handled differently: the server must include the complete state of the world, meaning that all resources of the relevant type that are needed by the client must be included, even if they did not change since the last response.

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
  "versionInfo": "1790190",
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "nonce": "1538496"
}

Request sent again with version 1790190.

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "1790190",
"resourceNames": [
"iam-machine-api.auth:5000"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "1538496"
}

This time the response for the same version 1790190 contains the listener details

2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc01c3a2460] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
"versionInfo": "1790190",
"resources": [
{
  "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "name": "iam-policy-decision-api.auth:5000",
  "apiListener": {
    "apiListener": {
      "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
      "routeConfig": {
        "name": "iam-policy-decision-api.auth:5000",
        "virtualHosts": [
          {
            "name": "iam-policy-decision-api.auth:5000",
            "domains": [
              "iam-policy-decision-api.auth:5000"
            ],
            "routes": [
              {
                "match": {
                  "prefix": ""
                },
                "route": {
                  "cluster": "iam-policy-decision-api.auth.5000"
                }
              }
            ],
            "retryPolicy": {
              "retryOn": "cancelled,deadline-exceeded,internal,unavailable",
              "numRetries": 3,
              "retryBackOff": {
                "baseInterval": "0.005s",
                "maxInterval": "0.050s"
              }
            }
          }
        ]
      },
      "httpFilters": [
        {
          "name": "envoy.filters.http.router",
          "typedConfig": {
            "@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
          }
        }
      ]
    }
  }
}
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"nonce": "1539401"
}

ffilippopoulos commented 1 month ago

@arjan-bal One difference I've spotted comparing to the previous client version (v1.65.0) is that following the incremental xDS, it was sending ADS requests including a list of resources. for example:

2024/09/13 09:14:32 INFO: [xds] [xds-client 0xc0093c02d0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo": "8",
  "resourceNames": [
    "iam-machine-api.auth:5000",
    "iam-policy-decision-api.auth:5000"
  ],
  "typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce": "16"
}

Our xDS server implementation used that list to keep an up to date snapshot for the respective node. This was implemented as the client's behaviour when a connection to a resource hit idle timeout was to send an ADS request with an updated list (missing the resources that timed out) to notify the xDS server that it offloaded respective resources from cache and the server needs to remove from snaphot. Please correct me if I am wrong here!

The reason I was asking about the ADS flow control feature added in the latest release is because we now see all our ADS requests containing a single resourceName, which would result in constantly updating our snapshots on the xDS server side.

easwars commented 1 month ago

The ADS flow control mechanism that was implemented recently only applies flow control on the receive path, i.e. once an ADS response is received by the xDS client in gRPC, it notifies all registered watchers for all the resources in that response, and blocks the next read on the ADS stream until all the watchers have processed the previous update.

One difference I've spotted comparing to the previous client version (v1.65.0) is that following the incremental xDS, it was sending ADS requests including a list of resources

gRPC only supports the SotW variant. It does not support the incremental variant. I'm not sure what exactly you mean here. The xDS server is expected to return all requested resources (for LDS and CDS), even if it knows that it had sent the same resource previously to this client.

The reason I was asking about the ADS flow control feature added in the latest release is because we now see all our ADS requests containing a single resourceName, which would result in constantly updating our snapshots on the xDS server side

That doesn't sound right and if that is the case, then its a bug. Would you be able to give us a simple way to reproduce this problem? Thanks.

ffilippopoulos commented 1 month ago

@easwars If I create a very simplistic example, where I call 2 echo servers like:

    serverAddr1 := "xds:///grpc-echo-server.labs:50051"
    serverAddr2 := "xds:///grpc-echo-another-server.labs:50051"
    conn1, err := grpc.Dial(serverAddr1, grpc.WithInsecure())
    if err != nil {
        log.Fatalf("Could not connect %v", err)
    }
    defer conn1.Close()
    c1 := echo.NewEchoServerClient(conn1)
    ctx := context.Background()
    r1, err := c1.SayHello(ctx, &echo.EchoRequest{Name: "unary RPC msg "})
    if err != nil {
        log.Printf("Could not get RPC %v\n", err)
    }
    log.Printf("RPC Response: %v", r1)
    time.Sleep(1 * time.Second)
    conn2, err := grpc.Dial(serverAddr2, grpc.WithInsecure())
    if err != nil {
        log.Fatalf("Could not connect %v", err)
    }
    defer conn1.Close()
    c2 := echo.NewEchoServerClient(conn2)
    r2, err := c2.SayHello(ctx, &echo.EchoRequest{Name: "unary RPC msg "})
    if err != nil {
        log.Printf("Could not get RPC %v\n", err)
    }
    log.Printf("RPC Response: %v", r2)

and build this with grpc-go v1.65.0 and latest (v1.66.2), I can see how the behaviour changes between releases. Logs when running with v1.65.0: https://gist.github.com/ffilippopoulos/c2112dccb04e05a7255184bb305b4593 Logs when running with v1.66.2: https://gist.github.com/ffilippopoulos/d5576a2ac7efbc8e9f4e89aa9fe920a7

From there you can see that when building the above code with grpc-go v1.65, I get the following ADS discovery requests

2024/09/13 20:41:13 INFO: [xds] [xds-client 0xc0002ceeb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "node":  {
    "id":  "labs/grpc-echo-client-869c7c9dfb-jmrm6",
    "locality":  {},
    "userAgentName":  "gRPC Go",
    "userAgentVersion":  "1.65.0",
    "clientFeatures":  [
      "envoy.lb.does_not_support_overprovisioning",
      "xds.config.resource-in-sotw"
    ]
  },
  "resourceNames":  [
    "grpc-echo-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener"
}

2024/09/13 20:41:14 INFO: [xds] [xds-client 0xc0002ceeb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "versionInfo":  "2",
  "resourceNames":  [
    "grpc-echo-server.labs:50051",
    "grpc-echo-another-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener",
  "responseNonce":  "3"
}

While when using v1.66.2 it looks like my requests are sent from different clients:

2024/09/13 20:47:37 INFO: [xds] [xds-client 0xc00023a9b0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "node":  {
    "id":  "labs/grpc-echo-client-5b48dff6b5-slqq4",
    "userAgentName":  "gRPC Go",
    "userAgentVersion":  "1.66.2",
    "clientFeatures":  [
      "envoy.lb.does_not_support_overprovisioning",
      "xds.config.resource-in-sotw"
    ]
  },
  "resourceNames":  [
    "grpc-echo-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener"
}

2024/09/13 20:47:38 INFO: [xds] [xds-client 0xc000374eb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
  "node":  {
    "id":  "labs/grpc-echo-client-5b48dff6b5-slqq4",
    "userAgentName":  "gRPC Go",
    "userAgentVersion":  "1.66.2",
    "clientFeatures":  [
      "envoy.lb.does_not_support_overprovisioning",
      "xds.config.resource-in-sotw"
    ]
  },
  "resourceNames":  [
    "grpc-echo-another-server.labs:50051"
  ],
  "typeUrl":  "type.googleapis.com/envoy.config.listener.v3.Listener"
}

I appreciate my knowledge in grpc xDS might not be very deep so please let me know if I am doing anything wrong here.

easwars commented 1 month ago

Ah, now I see what is happening. So, this is not really related to the flow control changes that we made, but is related to some fallback changes that we are in the process of making.

So, earlier, we used to have a global singleton xDS client for the whole grpc application and if the application creates multiple grpc channels (thereby requesting multiple listener resources), all these listener resources would be requested by the same xDS client, and therefore would result in multiple listener resource names being specified in a single LDS request.

But with fallback, we have switched to using a separate xDS client per grpc target URI. So, if you create three grpc channels:

channel1 to URI xds:///foo
channel2 to URI xds:///foo
channel3 to URI xds:///bar channels 1 and 2, would share an xDS client, while channel3 would get its own.

On the server side though, all servers share a single xDS client. Hope this helps.

This is the PR that contains the changes I described above: https://github.com/grpc/grpc-go/pull/7347

And this is the design for the overall fallback feature: https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md

purnesh42H commented 1 month ago

@easwars do we need to do anything from our side or this is intended behavior?

easwars commented 1 month ago

@easwars do we need to do anything from our side or this is intended behavior?

This is working as per the design.

purnesh42H commented 1 month ago

@ffilippopoulos feel free to reopen if you have any more questions

grpc / grpc-go

1.66.0 - increase in produced zero addresses #7625

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

What version of Go are you using (`go version`)?