Closed rcrowe closed 1 month ago
Before each of the error logs, the listener resource for iam-machine-api.auth
isn't found (maybe the servers restarted?). This is causing an empty resolver state to be created which results in the creation of a pickfirst
leaf LB policy which is the default LB policy when none is specified. pickfirst
is given the empty endpoint list and it logs the error we're seeing.
2024/09/12 07:38:25 INFO: [xds] [xds-resolver 0xc00918f320] Received resource-not-found-error for Listener resource "iam-machine-api.auth:5000"
2024/09/12 07:38:25 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "1789713",
"resourceNames": [
"iam-machine-api.auth:5000"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "1538097"
}
2024/09/12 07:38:25 INFO: [xds] [xds-resolver 0xc00918f320] For Listener resource "iam-machine-api.auth:5000" and RouteConfiguration resource "", generated service config: {
"loadBalancingConfig": [
{
"xds_cluster_manager_experimental": {
"children": {
"cluster:iam-machine-api.auth.5000": {
"childPolicy": [
{
"cds_experimental": {
"cluster": "iam-machine-api.auth.5000"
}
}
]
}
}
}
}
]
}
2024/09/12 07:38:25 INFO: [core] [Channel #9]Resolver state updated: {
"Addresses": null,
"Endpoints": [],
"ServiceConfig": {
"Config": {
"Config": null,
"Methods": {}
},
"Err": null
},
"Attributes": {
"\u003c%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)\u003e": "\u003c0x0\u003e",
"\u003c%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)\u003e": "\u003c0xc007d369b0\u003e"
}
} ()
2024/09/12 07:38:25 INFO: [xds] [xds-cluster-manager-lb 0xc02deeda00] Update with config {
"LoadBalancingConfig": null,
"Children": {
"cluster:iam-machine-api.auth.5000": {
"ChildPolicy": [
{
"cds_experimental": {
"LoadBalancingConfig": null,
"Cluster": "iam-machine-api.auth.5000"
}
}
]
}
}
}, resolver state {Addresses:[] Endpoints:[] ServiceConfig:0xc02fff8580 Attributes:{"<%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)>": "<0x0>" , "<%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)>": "<0xc007d369b0>" }}
2024/09/12 07:38:25 INFO: [xds] [cds-lb 0xc02f3cc9c0] Received balancer config update: {
"LoadBalancingConfig": null,
"Cluster": "iam-machine-api.auth.5000"
}
2024/09/12 07:38:25 INFO: [core] [Channel #9]Resolver state updated: {
"Addresses": null,
"Endpoints": [],
"ServiceConfig": {
"Config": {
"Config": null,
"Methods": {}
},
"Err": null
},
"Attributes": null
} (service config updated)
2024/09/12 07:38:25 INFO: [core] [Channel #9]Channel switches to new LB policy "pick_first"
2024/09/12 07:38:25 INFO: [pick-first-lb] [pick-first-lb 0xc02fffbda0] Received error from the name resolver: produced zero addresses
2024/09/12 07:38:25 INFO: [core] [Channel #9]Channel Connectivity change to TRANSIENT_FAILURE
2024/09/12 07:38:25 INFO: [core] error from balancer.UpdateClientConnState: bad resolver state
Once the listener becomes available, pickfirst
is removed. Still looking into whether anything changed about the handling of not found listeners in 1.66 which would cause these logs to show up.
Thanks @arjan-bal
The logs were captured from a service that would have been hitting a few services configured via xDS, however, during our investigation only focused on iam-policy-decision-api.auth
so can only say with certainty that this endpoint was exhibiting the issue. It could well have been the same with the machine endpoint.
I'll take that away & try to dig into why the listener resource is missing as I see from the logs that's happening across both endpoints being discussed here; best I can tell is that none of the servers have been restarted for some time so it's not immediately clear why this would be missing.
@arjan-bal Reading through the logs, it seems to me that the listener resource config for iam-machine-api.auth:5000
is served down the client and is added to it's cache:
2024/09/12 07:38:22 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] Resource type "ListenerResource" with name "iam-machine-api.auth:5000" added to cache
And then subsequent requests to the rest of the APIs are also successful and initiate watches. Then at some point after the client sends another ADS request for the same listener:
2024/09/12 07:38:26 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "1789812",
"resourceNames": [
"iam-machine-api.auth:5000"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "1538184"
which in fact receives an empty response:
2024/09/12 07:38:26 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
"versionInfo": "1789812",
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"nonce": "1538184"
}
and triggers the resource-not-found-error
error.
This though should be as the xDS server thinks that the client is up to date and no update should be sent.
The only change I can see that could have affected the call flow is the ADS stream flow control mechanism: https://github.com/grpc/grpc-go/pull/7458. Is there anything new we need to implement on the xDS server side to respect the new flow?
Here are the events leading up to one of the resource-not-found
errors:
1790189
.
2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "1790189",
"resourceNames": [
"iam-machine-api.auth:5000"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "1538494"
}
1790190
. Since the version has changed, I believe this should be considered as a not found
response instead of not updated
. Based on the envoy docs
In the SotW protocol variants, all resource types except for Listener and Cluster are grouped into responses in the same way as in the incremental protocol variants. However, Listener and Cluster resource types are handled differently: the server must include the complete state of the world, meaning that all resources of the relevant type that are needed by the client must be included, even if they did not change since the last response.
2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
"versionInfo": "1790190",
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"nonce": "1538496"
}
1790190
.
2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc0091edd10] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "1790190",
"resourceNames": [
"iam-machine-api.auth:5000"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "1538496"
}
1790190
contains the listener details
2024/09/12 07:38:33 INFO: [xds] [xds-client 0xc01c3a2460] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS response received: {
"versionInfo": "1790190",
"resources": [
{
"@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
"name": "iam-policy-decision-api.auth:5000",
"apiListener": {
"apiListener": {
"@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
"routeConfig": {
"name": "iam-policy-decision-api.auth:5000",
"virtualHosts": [
{
"name": "iam-policy-decision-api.auth:5000",
"domains": [
"iam-policy-decision-api.auth:5000"
],
"routes": [
{
"match": {
"prefix": ""
},
"route": {
"cluster": "iam-policy-decision-api.auth.5000"
}
}
],
"retryPolicy": {
"retryOn": "cancelled,deadline-exceeded,internal,unavailable",
"numRetries": 3,
"retryBackOff": {
"baseInterval": "0.005s",
"maxInterval": "0.050s"
}
}
}
]
},
"httpFilters": [
{
"name": "envoy.filters.http.router",
"typedConfig": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
}
}
]
}
}
}
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"nonce": "1539401"
}
@arjan-bal One difference I've spotted comparing to the previous client version (v1.65.0
) is that following the incremental xDS, it was sending ADS requests including a list of resources. for example:
2024/09/13 09:14:32 INFO: [xds] [xds-client 0xc0093c02d0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "8",
"resourceNames": [
"iam-machine-api.auth:5000",
"iam-policy-decision-api.auth:5000"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "16"
}
Our xDS server implementation used that list to keep an up to date snapshot for the respective node. This was implemented as the client's behaviour when a connection to a resource hit idle timeout was to send an ADS request with an updated list (missing the resources that timed out) to notify the xDS server that it offloaded respective resources from cache and the server needs to remove from snaphot. Please correct me if I am wrong here!
The reason I was asking about the ADS flow control feature added in the latest release is because we now see all our ADS requests containing a single resourceName
, which would result in constantly updating our snapshots on the xDS server side.
The ADS flow control mechanism that was implemented recently only applies flow control on the receive path, i.e. once an ADS response is received by the xDS client in gRPC, it notifies all registered watchers for all the resources in that response, and blocks the next read on the ADS stream until all the watchers have processed the previous update.
One difference I've spotted comparing to the previous client version (v1.65.0) is that following the incremental xDS, it was sending ADS requests including a list of resources
gRPC only supports the SotW variant. It does not support the incremental variant. I'm not sure what exactly you mean here. The xDS server is expected to return all requested resources (for LDS and CDS), even if it knows that it had sent the same resource previously to this client.
The reason I was asking about the ADS flow control feature added in the latest release is because we now see all our ADS requests containing a single resourceName, which would result in constantly updating our snapshots on the xDS server side
That doesn't sound right and if that is the case, then its a bug. Would you be able to give us a simple way to reproduce this problem? Thanks.
@easwars If I create a very simplistic example, where I call 2 echo servers like:
serverAddr1 := "xds:///grpc-echo-server.labs:50051"
serverAddr2 := "xds:///grpc-echo-another-server.labs:50051"
conn1, err := grpc.Dial(serverAddr1, grpc.WithInsecure())
if err != nil {
log.Fatalf("Could not connect %v", err)
}
defer conn1.Close()
c1 := echo.NewEchoServerClient(conn1)
ctx := context.Background()
r1, err := c1.SayHello(ctx, &echo.EchoRequest{Name: "unary RPC msg "})
if err != nil {
log.Printf("Could not get RPC %v\n", err)
}
log.Printf("RPC Response: %v", r1)
time.Sleep(1 * time.Second)
conn2, err := grpc.Dial(serverAddr2, grpc.WithInsecure())
if err != nil {
log.Fatalf("Could not connect %v", err)
}
defer conn1.Close()
c2 := echo.NewEchoServerClient(conn2)
r2, err := c2.SayHello(ctx, &echo.EchoRequest{Name: "unary RPC msg "})
if err != nil {
log.Printf("Could not get RPC %v\n", err)
}
log.Printf("RPC Response: %v", r2)
and build this with grpc-go v1.65.0 and latest (v1.66.2), I can see how the behaviour changes between releases. Logs when running with v1.65.0: https://gist.github.com/ffilippopoulos/c2112dccb04e05a7255184bb305b4593 Logs when running with v1.66.2: https://gist.github.com/ffilippopoulos/d5576a2ac7efbc8e9f4e89aa9fe920a7
From there you can see that when building the above code with grpc-go v1.65, I get the following ADS discovery requests
2024/09/13 20:41:13 INFO: [xds] [xds-client 0xc0002ceeb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"node": {
"id": "labs/grpc-echo-client-869c7c9dfb-jmrm6",
"locality": {},
"userAgentName": "gRPC Go",
"userAgentVersion": "1.65.0",
"clientFeatures": [
"envoy.lb.does_not_support_overprovisioning",
"xds.config.resource-in-sotw"
]
},
"resourceNames": [
"grpc-echo-server.labs:50051"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener"
}
2024/09/13 20:41:14 INFO: [xds] [xds-client 0xc0002ceeb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"versionInfo": "2",
"resourceNames": [
"grpc-echo-server.labs:50051",
"grpc-echo-another-server.labs:50051"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener",
"responseNonce": "3"
}
While when using v1.66.2 it looks like my requests are sent from different clients:
2024/09/13 20:47:37 INFO: [xds] [xds-client 0xc00023a9b0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"node": {
"id": "labs/grpc-echo-client-5b48dff6b5-slqq4",
"userAgentName": "gRPC Go",
"userAgentVersion": "1.66.2",
"clientFeatures": [
"envoy.lb.does_not_support_overprovisioning",
"xds.config.resource-in-sotw"
]
},
"resourceNames": [
"grpc-echo-server.labs:50051"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener"
}
2024/09/13 20:47:38 INFO: [xds] [xds-client 0xc000374eb0] [semaphore-xds.sys-semaphore.svc.cluster.local:18000] ADS request sent: {
"node": {
"id": "labs/grpc-echo-client-5b48dff6b5-slqq4",
"userAgentName": "gRPC Go",
"userAgentVersion": "1.66.2",
"clientFeatures": [
"envoy.lb.does_not_support_overprovisioning",
"xds.config.resource-in-sotw"
]
},
"resourceNames": [
"grpc-echo-another-server.labs:50051"
],
"typeUrl": "type.googleapis.com/envoy.config.listener.v3.Listener"
}
I appreciate my knowledge in grpc xDS might not be very deep so please let me know if I am doing anything wrong here.
Ah, now I see what is happening. So, this is not really related to the flow control changes that we made, but is related to some fallback changes that we are in the process of making.
So, earlier, we used to have a global singleton xDS client for the whole grpc application and if the application creates multiple grpc channels (thereby requesting multiple listener resources), all these listener resources would be requested by the same xDS client, and therefore would result in multiple listener resource names being specified in a single LDS request.
But with fallback, we have switched to using a separate xDS client per grpc target URI. So, if you create three grpc channels:
xds:///foo
xds:///foo
xds:///bar
channels 1 and 2, would share an xDS client, while channel3 would get its own.On the server side though, all servers share a single xDS client. Hope this helps.
This is the PR that contains the changes I described above: https://github.com/grpc/grpc-go/pull/7347
And this is the design for the overall fallback feature: https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md
@easwars do we need to do anything from our side or this is intended behavior?
@easwars do we need to do anything from our side or this is intended behavior?
This is working as per the design.
@ffilippopoulos feel free to reopen if you have any more questions
What version of gRPC are you using?
v.1.66.0
What version of Go are you using (
go version
)?1.22.5
What operating system (Linux, Windows, …) and version?
Alpine 3.14
K8s v1.30.1
What did you do?
Ran
v.1.66.0
across a fleet of gRPC clients & servers, using xDS to talk to an in-house control plane.xDS services configured with
round_robin
load-balancer.Logs captured with:
GRPC_GO_LOG_VERBOSITY_LEVEL=99
GRPC_GO_LOG_SEVERITY_LEVEL=info
What did you expect to see?
Running
v1.65.0
we observe no such errors in with our clients talking to our servers.What did you see instead?
A % of requests will see...
We observe the errors happening multiple times per-minute, for a few minutes & then the errors will drop for up to 5 minutes. It's frequent enough to reliably replicate the problem across all our environments should you need us to capture more information.
I've not yet understood the pattern of when it fails & how often, but have attached logs.tar.gz from a small timeframe when we observed multiple errors, for a single container.
We were focused on the falling calls to
iam-policy-decision-api.auth:5000
during our debugging & by querying against our trace backend, observed the following timestamps where these errors occurred. Note logs are in UTC while traces bellow are in BST (UTC+1).