grpc / grpc

The C based gRPC (C++, Python, Ruby, Objective-C, PHP, C#)
https://grpc.io
Apache License 2.0
41.7k stars 10.51k forks source link

Keepalive watchdog timer fires against gRPCLB, causing intermittent UNAVAILABLE errors #34108

Open noahdav opened 1 year ago

noahdav commented 1 year ago

I am trying to reference issue https://github.com/grpc/grpc/issues/33344. Where we are seeing the watchdog timeout occur and this error is returned to the clients.

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "empty address list: grpclb in fallback mode without any balancer addresses: "
    debug_error_string = "{"created":"@1678163703.989066479","description":"Error received from peer <elided>","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"empty address list: grpclb in fallback mode without any balancer addresses: ","grpc_status":14}"

It looks like this is the actual failure.

I0526 14:25:37.198248096 89 grpclb.cc:1328] [grpclb 0x19b3550] lb_calld=0x1a6bc40: Status from LB server received. Status = 14, details = 'keepalive watchdog timeout', (lb_call: 0x1a58130), error 'OK'

@yashykt

Sorry for the delay in providing the logs. We have had trouble getting a repo with the trace you requested. Instead we have an older instance with all tracing so we filtered out much of this to make the logs reasonable. If we happened to filter out too much please let me know and I can regenerate logs with whatever filters you need.

I am attaching a segment of the log which you can see that the keepalive is sent but for some reason it is failing and eventually the error is bubbled up to the client.

On line 514 you can see the start of the keepalive ping.
"1686950341207","I0616 16:19:01.206883327 206 writing.cc:142] CLIENT: Ping sent [ipv4:10.11.162.170:4242]: 0/0"

On line 547 the keepalive fails and the transport starts closing. "1686950552016","I0616 16:22:32.016505957 313 chttp2_transport.cc:2940] ipv4:10.11.162.170:4242: Keepalive watchdog fired. Closing transport."

This same pattern occurs a number of times then on line 995 this error is returned to the client. "1686950586705","2023-06-16 16:23:06,705 - INFO - c5c3d000-865e-4d8c-8445-7b51ec98ad71: Got a grpc error with status = StatusCode.UNAVAILABLE, details = empty address list: grpclb in fallback mode without any balancer addresses: , debug_error_string = {""created"":""@1686950586.704988243"",""description"":""Error received from peer <endpoint>:4242"",""file"":""src/core/lib/surface/call.cc"",""file_line"":952,""grpc_message"":""empty address list: grpclb in fallback mode without any balancer addresses: "",""grpc_status"":14}"

2023-06-16t14_13_06.705-07_00-2023-06-16t14_33_06.705-07_00.-filtered.csv

Originally posted by @noahdav in https://github.com/grpc/grpc/issues/33344#issuecomment-1681189680

noahdav commented 1 year ago

@yashykt

Can we get any assistance here?