grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.43k stars 640 forks source link

Intermittently the client enters a state where it doesnt receive response sent by server #2502

Open P0rth0s opened 1 year ago

P0rth0s commented 1 year ago

Problem description

Intermittently our grpc client is entering a state where the server is sending a response to the client, but the client doesnt receive it and throws a DEADLINE_EXCEEDED error. The error persists on retries until the server or client is restarted.

Reproduction steps

Unknown - appears to eventually enter this state in longer lived environments.

Environment

Additional context

Client Logs


D 2023-07-12T20:00:37.761Z | resolving_call | [4] Created

D 2023-07-12T20:00:37.761Z | channel | (43) dns:<redacted ip>createResolvingCall [4] method=“<redacted method>”, deadline=2023-07-12T20:01:22.760Z

D 2023-07-12T20:00:37.762Z | resolving_call | [4] start called

D 2023-07-12T20:00:37.762Z | resolving_call | [4] Deadline will be reached in 44998ms

D 2023-07-12T20:00:37.762Z | resolving_call | [4] Deadline: 2023-07-12T20:01:22.760Z

D 2023-07-12T20:00:37.763Z | resolving_call | [4] startRead called

D 2023-07-12T20:00:37.764Z | resolving_call | [4] halfClose called

D 2023-07-12T20:00:37.764Z | resolving_call | [4] write() called with message of length 38

D 2023-07-12T20:00:37.764Z | resolving_call | [4] Created child [5]

D 2023-07-12T20:00:37.764Z | channel | (43) dns:<redacted ip> createRetryingCall [5] method="<redacted method>"

D 2023-07-12T20:00:37.765Z | load_balancing_call | [6] start called

D 2023-07-12T20:00:37.765Z | retrying_call | [5] Created child call [6] for attempt 1

D 2023-07-12T20:00:37.765Z | channel | (43) dns:<redacted ip> createLoadBalancingCall [6] method="<redacted method>"

D 2023-07-12T20:00:37.765Z | retrying_call | [5] start called

D 2023-07-12T20:00:37.766Z | load_balancing_call | [6] Pick called

D 2023-07-12T20:00:37.766Z | load_balancing_call | [6] Pick result: COMPLETE subchannel: (44) <redacted ip> status: undefined undefined

D 2023-07-12T20:00:37.766Z | retrying_call | [5] startRead called

D 2023-07-12T20:00:37.770Z | load_balancing_call | [6] Created child call [7]

D 2023-07-12T20:00:37.770Z | transport_internals | (45) <redacted ip> session.closed=false session.destroyed=false session.socket.destroyed=false

D 2023-07-12T20:00:37.770Z | transport_flowctrl | (45) <redacted ip> local window size: 65535 remote window size: 65535

D 2023-07-12T20:00:37.771Z | retrying_call | [5] write() called with message of length 43

D 2023-07-12T20:00:37.771Z | subchannel_call | [7] sending data chunk of length 43

D 2023-07-12T20:00:37.771Z | subchannel_call | [7] write() called with message of length 43

D 2023-07-12T20:00:37.771Z | load_balancing_call | [6] write() called with message of length 43

D 2023-07-12T20:00:37.772Z | retrying_call | [5] halfClose called

D 2023-07-12T20:00:37.773Z | subchannel_call | [7] calling end() on HTTP/2 stream

D 2023-07-12T20:00:37.773Z | subchannel_call | [7] end() called

D 2023-07-12T20:00:37.773Z | load_balancing_call | [6] halfClose called

D 2023-07-12T20:01:22.760Z | resolving_call | [4] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.760Z | retrying_call | [5] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.761Z | retrying_call | [5] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.761Z | load_balancing_call | [6] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.761Z | subchannel_call | [7] cancelWithStatus code: 4 details: "Deadline exceeded"

D 2023-07-12T20:01:22.761Z | subchannel_call | [7] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.762Z | retrying_call | [5] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [6] in state ACTIVE

D 2023-07-12T20:01:22.762Z | retrying_call | [5] Received status from child [6]

D 2023-07-12T20:01:22.762Z | load_balancing_call | [6] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.762Z | subchannel_call | [7] close http2 stream with code 8

D 2023-07-12T20:01:22.763Z | resolving_call | [4] Received status

D 2023-07-12T20:01:22.763Z | load_balancing_call | [6] Received status

D 2023-07-12T20:01:22.763Z | resolving_call | [4] Received status

D 2023-07-12T20:01:22.763Z | resolving_call | [4] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.763Z | retrying_call | [5] ended with status: code=4 details="Deadline exceeded"

D 2023-07-12T20:01:22.864Z | subchannel_call | [7] HTTP/2 stream closed with code 8

Server Logs


D 2023-07-12T20:00:37.774Z | server | (1) Received call to method <redacted method> at address null

D 2023-07-12T20:00:37.774Z | server_call | Request to <redacted method> received headers {"trackingid”:[“<trackingId>”],”grpc-accept-encoding":["identity,deflate,gzip"],"accept-encoding":["identity"],"grpc-timeout":["44993m"],"user-agent":["grpc-node-js/1.8.14"],"content-type":["application/grpc"],"te":["trailers"]}

D 2023-07-12T20:00:37.777Z | server_call | Request to method <redacted method> stream closed with rstCode 0

D 2023-07-12T20:00:37.777Z | server_call | Request to method <redacted method> ended with status code: OK details: OK

As you can see the server responds well within the deadline but the client never gets the response. I know transient failures can happen but since it persists on retries it appears there is something deeper going on here.

murgatroid99 commented 1 year ago

Did you have all tracers enabled here, or only some of them? You said that this persists on retries; does the server log that it receives multiple requests when those retries occur? Do you have keepalives enabled on the client? Can you trace this interaction with Wireshark or tcpdump and share the dump log?

P0rth0s commented 1 year ago

All tracers should be enabled, I have GRPC_TRACE="all" and GRPC_VERBOSITY="DEBUG".

Yes, retrying results in another set of approximately the same logs on both the client and server

I am not passing keepAlive. I will add it.

Will look into if it is possible for me to get a Wireshark trace.

murgatroid99 commented 1 year ago

Yes, retrying results in another set of approximately the same logs on both the client and server

OK, that probably means that the connection is still valid, which means that keepalives probably won't change anything.

P0rth0s commented 1 year ago

Environment was accidentally removed from this bad state before I could capture tcpdump. Will get tcpdump asap when I repro again, might be a couple days.

wwilfinger commented 1 year ago

I may be seeing this same issue, but it might be different. Let me know and I can open a separate issue if needed.

All these reproductions were using @gprc/grpc-js@1.8.17, @google-cloud/pubsub@3.2.1, and google-gax@3.6.0. We've seen the issue in the wild for at least the past six months or so, going off my own memory.

We run a workload in GKE that serves REST requests and also publishes to PubSub. Very occasionally, about once a month, a single pod will go off the rails and continually log THIS DEADLINE_EXCEEDED error from gax-nodejs. PubSub messages are never published, the pod never recovers, and we need to kill the pod.

I wrote a small script that, in a loop, awaits a publish to PubSub and waits 500ms. I added error logging to gax-nodejs around HERE. I was able to reproduce. Every error was this stack trace from grpc-node v1.8.17

Stack trace ``` Error: 4 DEADLINE_EXCEEDED: Deadline exceeded at callErrorFromStatus (/opt/code/node_modules/@grpc/grpc-js/build/src/call.js:31:19) at Object.onReceiveStatus (/opt/code/node_modules/@grpc/grpc-js/build/src/client.js:192:76) at Object.onReceiveStatus (/opt/code/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141) at Object.onReceiveStatus (/opt/code/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181) at /opt/code/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78 at process.processTicksAndRejections (node:internal/process/task_queues:77:11) for call at { code: 4, details: 'Deadline exceeded', metadata: Metadata { internalRepr: Map(0) {}, options: {} } } ```

I found this example of adding channelz support (thank you!). Added that, reproduced again.

channelz Shortly after startup grpcdebug works fine, but when I come back hours later even healthy pods still publishing to PubSub show this "failed to fetch subchannel" error. 🤷 ``` % grpcdebug localhost:5555 channelz channel 3 (...) failed to fetch subchannel (id=5): rpc error: code = NotFound desc = No subchannel data found for id 5 ``` But `--json` gives more info. The following is from the pod experiencing the issue.
json output `# grpcdebug localhost:5555 channelz channel 3 --json` ``` { "ref": { "channel_id": 3, "name": "pubsub.googleapis.com:443" }, "data": { "state": { "state": 3 }, "target": "pubsub.googleapis.com:443", "trace": { "num_events_logged": 2302, "creation_timestamp": { "seconds": 1689903203, "nanos": 711000000 }, "events": [ { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965014, "nanos": 105000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965014, "nanos": 105000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965014, "nanos": 105000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965014, "nanos": 105000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965017, "nanos": 305000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965017, "nanos": 305000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965017, "nanos": 305000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965017, "nanos": 305000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 206000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 206000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 206000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 305000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 306000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 306000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 306000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 306000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 306000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 405000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 405000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 406000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 406000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 406000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 406000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 406000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 506000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 506000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 506000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965018, "nanos": 705000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 606000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 705000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 705000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 705000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 706000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 706000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 706000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 706000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 707000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1689965021, "nanos": 806000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e READY", "severity": 1, "timestamp": { "seconds": 1689965031, "nanos": 506000000 }, "ChildRef": null }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1371, "name": "2607:f8b0:4001:c0b::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1372, "name": "108.177.111.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1373, "name": "2607:f8b0:4001:c0f::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1374, "name": "142.250.1.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1375, "name": "2607:f8b0:4001:c10::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1376, "name": "108.177.121.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1377, "name": "2607:f8b0:4001:c11::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1378, "name": "142.250.103.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1379, "name": "108.177.120.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1380, "name": "142.251.171.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1381, "name": "142.250.159.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1320, "name": "142.251.120.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1382, "name": "142.251.161.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1383, "name": "74.125.126.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1384, "name": "74.125.132.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1385, "name": "74.125.201.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1386, "name": "74.125.69.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1387, "name": "64.233.182.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1388, "name": "64.233.183.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 805000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 1389, "name": "173.194.193.95:443" } } }, { "description": "READY -\u003e READY", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 806000000 }, "ChildRef": null }, { "description": "Address resolution succeeded", "severity": 1, "timestamp": { "seconds": 1689965146, "nanos": 806000000 }, "ChildRef": null }, { "description": "READY -\u003e READY", "severity": 1, "timestamp": { "seconds": 1689965290, "nanos": 53000000 }, "ChildRef": null } ] }, "calls_started": 117739, "calls_succeeded": 116532, "calls_failed": 1207, "last_call_started_timestamp": { "seconds": 1689971008, "nanos": 560000000 } }, "subchannel_ref": [ { "subchannel_id": 5, "name": "74.125.69.95:443" }, { "subchannel_id": 179, "name": "108.177.120.95:443" }, { "subchannel_id": 227, "name": "74.125.201.95:443" }, { "subchannel_id": 332, "name": "64.233.182.95:443" }, { "subchannel_id": 423, "name": "142.251.171.95:443" }, { "subchannel_id": 504, "name": "209.85.145.95:443" }, { "subchannel_id": 551, "name": "142.250.148.95:443" }, { "subchannel_id": 644, "name": "173.194.197.95:443" }, { "subchannel_id": 749, "name": "142.250.1.95:443" }, { "subchannel_id": 825, "name": "172.217.212.95:443" }, { "subchannel_id": 898, "name": "209.85.146.95:443" }, { "subchannel_id": 977, "name": "74.125.132.95:443" }, { "subchannel_id": 1046, "name": "64.233.182.95:443" }, { "subchannel_id": 1146, "name": "172.217.212.95:443" }, { "subchannel_id": 1228, "name": "209.85.146.95:443" }, { "subchannel_id": 1320, "name": "142.251.120.95:443" } ] } ```
I can't get grpcdebug to show any info on the subchannels listed. Converting the trace event timestamps to same timezone as this metric screenshot...
Trace events `cat channel.json | jq -r '.data.trace.events[] | [.timestamp.seconds, .description] | @tsv' | while IFS=$'\t' read -r timestamp description; do echo $(date -d @${timestamp} --iso-8601=seconds) ${description}; done` ``` 2023-07-21T13:43:34-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:34-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:34-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:34-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:37-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:37-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:37-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:37-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:38-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:41-05:00 CONNECTING -> CONNECTING 2023-07-21T13:43:51-05:00 CONNECTING -> READY 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 Created subchannel or used existing subchannel 2023-07-21T13:45:46-05:00 READY -> READY 2023-07-21T13:45:46-05:00 Address resolution succeeded 2023-07-21T13:48:10-05:00 READY -> READY ```
![reconnecting](https://github.com/grpc/grpc-node/assets/11001826/7b1b9c5c-b3be-4a99-b4ef-b69b8644758f) Everything goes wrong around that 2023-07-21T13:43:34 timestamp. The DEADLINE_EXCEEDED errors start to be logged at 2023-07-21T13:43:38.906 and I am using a 5 second `initialRpcTimeoutMillis` so that matches up. CPU drops off a cliff a couple minutes later. Memory starts climbing (these pods eventually OOM if I let them). I tried capturing a tcpdump, but there's nothing hitting the wire. I have nothing to show there.

I'm not sure what would trigger what I think is the resolving_load_balancer to re-resolve, but it looks like that's what happened in this case.

I'm running this on a zonal GKE cluster in us-central1-a and it will reproduce on at least one of ten pods within 12 hours. I've not been able to reproduce on any of: Local PC running on my consumer Internet line, GKE Autopilot in us-central1, regional GKE cluster in us-west1.

Please let me know what I can collect that would be most useful for you to debug. I don't have tracing turned on because I didn't want to give myself a big GCP logging bill, but I might be able to write the tracing to a file and get it that way.

murgatroid99 commented 1 year ago

The DEADLINE_EXCEEDED error you linked from google-gax has the error text Total timeout of API ${apiName} exceeded ${retry.backoffSettings.totalTimeoutMillis} milliseconds before any response was received. but the stack trace you shared has the error text Deadline exceeded. It should only be one of those or the other, so can you clarify what you are seeing there?

The channelz error you are seeing may indicate a channelz bug in the client. That component is not well-tested, so I wouldn't be surprised. Can you double check that you cannot get info on any of the listed subchannels?

I'll look into this more on Monday.

wwilfinger commented 1 year ago

can you clarify what you are seeing there?

I'm seeing both. Sorry about being unclear.

With no changes to gax-nodejs, I would only ever see the "Total timeout of API ${apiName} exceeded" error from HERE in gax.

Gax does not log or provide the caller for a way to log an error from grpc-js that gax is going to retry. That would be done in gax around this line HERE. There are seven retry codes configured for google.pubsub.v1.Publisher Publish set HERE. That means I don't know which of seven error codes are actually happening, no real idea of what's going on.

I added logging into gax (at that line) to log the error message gax is receiving from grpc-js and to also log some of the variables in scope at that point.

Here's a screenshot of the logging with my patch I had saved from earlier.

screenshot ![log](https://github.com/grpc/grpc-node/assets/11001826/51b5472c-65d4-47ed-a5f3-d094c56107b4)

These are the retry settings I'm using.

retry settings ``` // These are defaults but with shorter initalRpcTimeoutMillis // https://github.com/googleapis/nodejs-pubsub/blob/v3.2.1/src/v1/publisher_client_config.json#L28-L38 backoffSettings: { initialRetryDelayMillis: 100, retryDelayMultiplier: 1.3, maxRetryDelayMillis: 60000, initialRpcTimeoutMillis: 5000, // default 60000 rpcTimeoutMultiplier: 1.0, maxRpcTimeoutMillis: 60000, totalTimeoutMillis: 60000, }, ```

initialRpcTimeoutMillis=5000 and totalTimeoutMillis=60000 means gax is retrying every roughly 5000ms. The log lines with the green details are the DEADLINE_EXEEDED errors from grpc-js HERE. After 12 tries at 5000ms, over 60,000 ms has passed, which exceeds totalTimeoutMillis so gax throws this error HERE.

Once a pod gets into this state it never recovers. Very similar to P0rth0s reporting "The error persists on retries until the server or client is restarted" which is why I added to this issue instead of creating a new one.

Thanks for the eyes on it!

murgatroid99 commented 1 year ago

Please try updating your dependencies so that you pick up @grpc/grpc-js version 1.8.19, and then enable keepalives. As far as I understand, you can do this with pubsub by constructing the instance like this:

const pubsubClient = new PubSub({ 'grpc.keepalive_timeout_ms': 10000, 'grpc.keepalive_time_ms': 30000 } as any);

The as any is only needed if you are using TypeScript. If you are already passing other options to that constructor, these options can simply be added to the existing options object. The specific numbers there are suggested values, you can change them if necessary.

If that doesn't help, we can look into investigating further with trace logs.

P0rth0s commented 1 year ago

@murgatroid99 I sent you an email containing packet captures.

Note I was originally getting DEADLINE_EXCEEDED errors but then started getting CANCELLED errors instead. However all other behavior appears consistent.

oldmantaiter commented 1 year ago

FWIW I was encountering this using the @opentelemetry/exporter-trace-otlp-grpc package where all of a sudden every emission of trace data was met with the DEADLINE_EXCEEDED error. There was also a memory leak when a service instance got into this state.

This was on version 1.8.18 but it was also happening as far back as 1.4.1 although it was less frequent with that version. To stabilize our services we moved to using the HTTP exporter instead of the GRPC one, but I'm bringing it up in case it can add some investigatory context in looking how the @opentelemetry/exporter-trace-otlp-grpc has implemented connection handling and/or options passed to @grpc/grpc-js.

murgatroid99 commented 1 year ago

@oldmantaiter From your description of the error, I think you would benefit from enabling keepalives, but unfortunately it looks like there isn't a way to inject options to the relevant gRPC client constructor call here. You may want to open an issue with that repository to get them to either set those options or allow you to set options.

oldmantaiter commented 1 year ago

@murgatroid99 - Yep, that's where I ended up and switched libraries to resolve our issue, I'll open an issue to ask them to allow setting GRPC options so others running into it might be able to enable keepalives in the future.

wwilfinger commented 1 year ago

After a few goof-ups, I was able to reproduce and grab trace logs. This was with:

I have a tarball of the trace logs available here: https://github.com/wwilfinger/grpc-node-deadline-exceeded/tree/main/trace-logs

![image](https://github.com/grpc/grpc-node/assets/11001826/039a16dd-1cb3-4e16-8281-0e384d1f5466) (Screenshot is in UTC-5). The first error message is at 2023-08-01T02:42:31.319Z. I'm exec'ing into the pod and running some commands at around 2023-08-01T14:45Z
channelz ``` root@pubsub-test-deployment-7b74dbd4cf-k7vxt:/opt/code# grpcdebug localhost:5555 channelz channels Channel ID Target State Calls(Started/Succeeded/Failed) Created Time 3 pubsub.googleapis.com:443 READY 80617/72393/8223 22 hours ago root@pubsub-test-deployment-7b74dbd4cf-k7vxt:/opt/code# grpcdebug localhost:5555 channelz channel 3 Channel ID: 3 Target: pubsub.googleapis.com:443 State: READY Calls Started: 80623 Calls Succeeded: 72393 Calls Failed: 8229 Created Time: 22 hours ago --- 2023/08/01 14:22:20 failed to fetch subchannel (id=46): rpc error: code = NotFound desc = No subchannel data found for id 46 root@pubsub-test-deployment-7b74dbd4cf-k7vxt:/opt/code# grpcdebug localhost:5555 channelz channel 3 --json { "ref": { "channel_id": 3, "name": "pubsub.googleapis.com:443" }, "data": { "state": { "state": 3 }, "target": "pubsub.googleapis.com:443", "trace": { "num_events_logged": 1650, "creation_timestamp": { "seconds": 1690817609, "nanos": 22000000 }, "events": [ { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 321000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 821, "name": "142.250.125.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 321000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 739, "name": "142.250.136.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 418000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 822, "name": "142.250.148.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 418000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 823, "name": "209.85.200.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 418000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 824, "name": "142.251.172.95:443" } } }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 418000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 419000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854030, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 19000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 19000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 19000000 }, "ChildRef": null }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 119000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 812, "name": "2607:f8b0:4001:c0e::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 119000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 820, "name": "209.85.147.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 119000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 825, "name": "2607:f8b0:4001:c1e::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 119000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 821, "name": "142.250.125.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 826, "name": "2607:f8b0:4001:c1f::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 739, "name": "142.250.136.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 827, "name": "2607:f8b0:4001:c20::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 822, "name": "142.250.148.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 823, "name": "209.85.200.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 828, "name": "209.85.234.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 824, "name": "142.251.172.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 829, "name": "108.177.112.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 830, "name": "74.125.124.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 831, "name": "172.217.212.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 318000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 832, "name": "142.251.6.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 833, "name": "172.217.214.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 834, "name": "172.253.114.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 835, "name": "172.253.119.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 836, "name": "108.177.111.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 837, "name": "142.250.1.95:443" } } }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "Address resolution succeeded", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854031, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854032, "nanos": 218000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854032, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854032, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854032, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854032, "nanos": 318000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 320000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 418000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 419000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 419000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 419000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 518000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 518000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 519000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 519000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 519000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 618000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e READY", "severity": 1, "timestamp": { "seconds": 1690854034, "nanos": 919000000 }, "ChildRef": null }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 864, "name": "2607:f8b0:4001:c24::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 865, "name": "74.125.70.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 866, "name": "2607:f8b0:4001:c22::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 867, "name": "74.125.124.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 868, "name": "2607:f8b0:4001:c23::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 869, "name": "172.217.212.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 870, "name": "2607:f8b0:4001:c07::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 871, "name": "142.251.6.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 872, "name": "172.217.214.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 873, "name": "172.253.114.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 874, "name": "172.253.119.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 875, "name": "108.177.111.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 876, "name": "142.250.1.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 877, "name": "142.250.103.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 878, "name": "142.250.128.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 879, "name": "142.251.171.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 880, "name": "142.250.159.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 881, "name": "142.251.120.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 882, "name": "142.251.161.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 883, "name": "74.125.126.95:443" } } }, { "description": "Address resolution succeeded", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 205000000 }, "ChildRef": null }, { "description": "READY -\u003e READY", "severity": 1, "timestamp": { "seconds": 1690854135, "nanos": 618000000 }, "ChildRef": null }, { "description": "READY -\u003e READY", "severity": 1, "timestamp": { "seconds": 1690854271, "nanos": 307000000 }, "ChildRef": null }, { "description": "READY -\u003e IDLE", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 18000000 }, "ChildRef": null }, { "description": "IDLE -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "ChildRef": null }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 885, "name": "2607:f8b0:4001:c24::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 865, "name": "74.125.70.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 886, "name": "2607:f8b0:4001:c22::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 887, "name": "74.125.124.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 888, "name": "2607:f8b0:4001:c23::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 889, "name": "172.217.212.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 890, "name": "2607:f8b0:4001:c07::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 891, "name": "142.251.6.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 892, "name": "172.217.214.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 893, "name": "172.253.114.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 894, "name": "172.253.119.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 895, "name": "108.177.111.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 896, "name": "142.250.1.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 897, "name": "142.250.103.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 898, "name": "142.250.128.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 899, "name": "142.251.171.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 533000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 900, "name": "142.250.159.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 901, "name": "142.251.120.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 902, "name": "142.251.161.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 903, "name": "74.125.126.95:443" } } }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 820000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 820000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 820000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 820000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 419000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 519000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 519000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 519000000 }, "ChildRef": null }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 904, "name": "2607:f8b0:4001:c1c::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 905, "name": "209.85.145.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 906, "name": "2607:f8b0:4001:c0c::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 907, "name": "209.85.146.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 908, "name": "2607:f8b0:4001:c0e::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 618000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 909, "name": "209.85.147.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 910, "name": "2607:f8b0:4001:c1e::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 911, "name": "142.250.125.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 912, "name": "142.250.148.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 913, "name": "209.85.234.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 914, "name": "142.251.172.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 915, "name": "142.250.152.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 916, "name": "108.177.112.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 887, "name": "74.125.124.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 889, "name": "172.217.212.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 891, "name": "142.251.6.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 892, "name": "172.217.214.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 893, "name": "172.253.114.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 894, "name": "172.253.119.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 895, "name": "108.177.111.95:443" } } }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "Address resolution succeeded", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857737, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857738, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857738, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857738, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857738, "nanos": 219000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857740, "nanos": 920000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857741, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857741, "nanos": 619000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857741, "nanos": 620000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 118000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 118000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 120000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 318000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 318000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 319000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 320000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 918000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857742, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857743, "nanos": 418000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857743, "nanos": 419000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857743, "nanos": 919000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857743, "nanos": 920000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857743, "nanos": 920000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857743, "nanos": 921000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e READY", "severity": 1, "timestamp": { "seconds": 1690857746, "nanos": 918000000 }, "ChildRef": null }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 926, "name": "2607:f8b0:4001:c12::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 927, "name": "108.177.112.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 928, "name": "2607:f8b0:4001:c14::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 887, "name": "74.125.124.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 929, "name": "2607:f8b0:4001:c03::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 930, "name": "172.217.212.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 931, "name": "2607:f8b0:4001:c5a::5f:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 932, "name": "142.251.6.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 933, "name": "172.217.214.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 613000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 934, "name": "172.253.114.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 935, "name": "172.253.119.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 936, "name": "108.177.111.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 937, "name": "142.250.1.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 938, "name": "108.177.121.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 939, "name": "142.250.103.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 940, "name": "108.177.120.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 941, "name": "142.250.128.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 942, "name": "142.251.171.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 943, "name": "142.250.159.95:443" } } }, { "description": "Created subchannel or used existing subchannel", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": { "SubchannelRef": { "subchannel_id": 944, "name": "142.251.120.95:443" } } }, { "description": "READY -\u003e READY", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": null }, { "description": "Address resolution succeeded", "severity": 1, "timestamp": { "seconds": 1690857842, "nanos": 614000000 }, "ChildRef": null }, { "description": "READY -\u003e READY", "severity": 1, "timestamp": { "seconds": 1690857943, "nanos": 576000000 }, "ChildRef": null } ] }, "calls_started": 80634, "calls_succeeded": 72393, "calls_failed": 8240, "last_call_started_timestamp": { "seconds": 1690899790, "nanos": 352000000 } }, "subchannel_ref": [ { "subchannel_id": 46, "name": "173.194.198.95:443" }, { "subchannel_id": 89, "name": "142.251.120.95:443" }, { "subchannel_id": 205, "name": "74.125.202.95:443" }, { "subchannel_id": 344, "name": "209.85.200.95:443" }, { "subchannel_id": 396, "name": "173.194.196.95:443" }, { "subchannel_id": 456, "name": "173.194.195.95:443" }, { "subchannel_id": 579, "name": "209.85.234.95:443" }, { "subchannel_id": 657, "name": "173.194.193.95:443" }, { "subchannel_id": 739, "name": "142.250.136.95:443" }, { "subchannel_id": 821, "name": "142.250.125.95:443" }, { "subchannel_id": 887, "name": "74.125.124.95:443" } ] } root@pubsub-test-deployment-7b74dbd4cf-k7vxt:/opt/code# grpcdebug localhost:5555 channelz subchannel 887 Subchannel ID: 887 Target: 74.125.124.95:443 State: READY Calls Started: 8242 Calls Succeeded: 0 Calls Failed: 8241 Created Time: 11 hours ago --- panic: Address type not supported for goroutine 1 [running]: github.com/grpc-ecosystem/grpcdebug/cmd.prettyAddress(0x14eb4e0?) /go/pkg/mod/github.com/grpc-ecosystem/grpcdebug@v1.0.5/cmd/channelz.go:40 +0xd9 github.com/grpc-ecosystem/grpcdebug/cmd.printSockets({0xc00011c140, 0x1, 0xc000111d48?}) /go/pkg/mod/github.com/grpc-ecosystem/grpcdebug@v1.0.5/cmd/channelz.go:74 +0x105 github.com/grpc-ecosystem/grpcdebug/cmd.channelzSubchannelCommandRunWithError(0x2044c20?, {0xc00036f540, 0x1?, 0x1?}) /go/pkg/mod/github.com/grpc-ecosystem/grpcdebug@v1.0.5/cmd/channelz.go:218 +0x60a github.com/spf13/cobra.(*Command).execute(0x2044c20, {0xc00036f450, 0x1, 0x1}) /go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:850 +0x67c github.com/spf13/cobra.(*Command).ExecuteC(0x2045940) /go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:958 +0x39d github.com/spf13/cobra.(*Command).Execute(...) /go/pkg/mod/github.com/spf13/cobra@v1.1.1/command.go:895 github.com/grpc-ecosystem/grpcdebug/cmd.Execute() /go/pkg/mod/github.com/grpc-ecosystem/grpcdebug@v1.0.5/cmd/root.go:107 +0xde main.main() /go/pkg/mod/github.com/grpc-ecosystem/grpcdebug@v1.0.5/main.go:167 +0x17 root@pubsub-test-deployment-7b74dbd4cf-k7vxt:/opt/code# grpcdebug localhost:5555 channelz subchannel 887 --json { "ref": { "subchannel_id": 887, "name": "74.125.124.95:443" }, "data": { "state": { "state": 3 }, "target": "74.125.124.95:443", "trace": { "num_events_logged": 3, "creation_timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "events": [ { "description": "Subchannel created", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 532000000 }, "ChildRef": null }, { "description": "IDLE -\u003e CONNECTING", "severity": 1, "timestamp": { "seconds": 1690857736, "nanos": 820000000 }, "ChildRef": null }, { "description": "CONNECTING -\u003e READY", "severity": 1, "timestamp": { "seconds": 1690857746, "nanos": 820000000 }, "ChildRef": null } ] }, "calls_started": 8248, "calls_failed": 8247, "last_call_started_timestamp": { "seconds": 1690900026, "nanos": 947000000 } }, "socket_ref": [ { "socket_id": 918, "name": "74.125.124.95:443" } ] } ```
That very last subchannel 887 output has a target of `74.125.124.95:443`, but the kernel doesn't show any outbound tcp connections at all. The 5555 listener is what I configured channelz to listen on.
tcp info ``` root@pubsub-test-deployment-7b74dbd4cf-k7vxt:/opt/code# cat /proc/7/net/tcp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 0: 0100007F:15B3 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 3100672 1 0000000000000000 100 0 0 10 0 root@pubsub-test-deployment-7b74dbd4cf-k7vxt:/opt/code# ss -tap State Recv-Q Send-Q Local Address:Port Peer Address:Port Process LISTEN 0 511 127.0.0.1:5555 0.0.0.0:* users:(("node",pid=7,fd=22)) ``` Nothing is hitting eth0 so no packet captures.

Action starts here which is line 4,431,621 in the trace log file

I 2023-08-01T02:42:15.955Z | transport | (884) 74.125.70.95:443 connection closed by GOAWAY with code 0
I 2023-08-01T02:42:15.955Z | subchannel | (865) 74.125.70.95:443 READY -> IDLE

There's other examples earlier in this same trace log of a closed connection GOAWAY where gprc-js recovers just fine. Some lines I found interesting

I 2023-08-01T02:42:26.318Z | subchannel | (865) 74.125.70.95:443 CONNECTING -> READY
I 2023-08-01T02:42:26.820Z | subchannel | (887) 74.125.124.95:443 CONNECTING -> READY
I 2023-08-01T02:42:26.820Z | pick_first | Pick subchannel with address 74.125.124.95:443

...

I 2023-08-01T02:42:26.918Z | load_balancing_call | [289722] Pick result: COMPLETE subchannel: (887) 74.125.124.95:443 status: undefined undefined
I 2023-08-01T02:42:26.918Z | connectivity_state | (3) dns:pubsub.googleapis.com:443 CONNECTING -> READY
I 2023-08-01T02:42:28.319Z | transport_flowctrl | (918) 74.125.124.95:443 local window size: 65535 remote window size: 65535
I 2023-08-01T02:42:28.319Z | transport_internals | (918) 74.125.124.95:443 session.closed=false session.destroyed=false session.socket.destroyed=false

subchannel (887) is what was reported in channelz. Those (918) lines have the same remote ip and are repeated in the rest of the trace log. I see a Subchannel constructed with options line for (887) but never for (918).

murgatroid99 commented 1 year ago

Thank you for all of this detailed information. It looks like subchannel 887 is definitely the culprit. Every single request initiated with it ends with DEADLINE_EXCEEDED, starting right after it became "READY', so it may never have been properly connected. Unfortunately, there's no clear indication in the logs of why that happened, so this is going to take more investigation.

Fortunately, keepalives should fix this for you. If there really is no connection, the client should see that on the first ping it sends and go back to connecting again. You should use the options I suggested in https://github.com/grpc/grpc-node/issues/2502#issuecomment-1648577842.

I see a Subchannel constructed with options line for (887) but never for (918).

That is correct. 918 is the ID for a transport, not a subchannel. That corresponds to a "socket" in channelz, and you can see that ID in the socket_ref in the channelz information for subchannel 887.

A few notes about the logging output, mostly for things I need to change:

  1. Some of the transport logs say [object Object] instead of the ID number. That has been fixed, and the fix will go out in 1.9.0.
  2. The channelz channel info appears to have some stale references to previously-connected subchannels. I believe this is also fixed in 1.9.0.
  3. There is something about what grpcdebug expects in a subchannel's address that doesn't match what this library is outputting.
  4. The > character doesn't render well in grpcdebug output, so it might be good to change the format of those logs.
  5. There is a lot of CONNECTING -> CONNECTING spam in channelz logs.
wwilfinger commented 1 year ago

Thanks so much for the detailed response. I'll turn on keepalives and see what happens.

That is correct. 918 is the ID for a transport, not a subchannel. That corresponds to a "socket" in channelz, and you can see that ID in the socket_ref in the channelz information for subchannel 887.

Ah okay, that makes sense!

murgatroid99 commented 1 year ago

Here's something interesting: the command grpcdebug localhost:5555 channelz subchannel 887 failed because it was trying to pretty-print the socket's local and remote addresses, and it fails because one of those values is nil (null in the protobuf message). With a normal working connection, both of those should have a non-null value.

wwilfinger commented 1 year ago

(I think the keepalives will help our production deployment. I haven't seen it again in production but we were only seeing it about once a month or so anyway. I'm still interested in what's going on. I also know there's a lot of existing code not configuring keepalives.)

My test setup (running in GKE standard [not autopilot] cluster, us-central1, e2-medium node) reproduced twice overnight. I was able to get heapdumps, grpc traces, and packet captures for both reproductions.

I can email you the packet captures with the tls keylog and the heapdumps if requested. I believe the bearer token for the service account would be expired by now, but it's all over both of those I don't really want to post that all on the internet forever.

Trace logs, grpcdebug output, and some shell cmd output are available here. "26q77" and "rcmzn" were suffixes on the k8s pods that reproduced. This all looks similar to my previous reproduction, but I have packet captures.

26q77 trace.log line 2,361,090 ``` I 2023-08-08T08:29:13.506Z | transport | (449) 172.217.214.95:443 connection closed by GOAWAY with code 0 I 2023-08-08T08:29:13.506Z | subchannel | (430) 172.217.214.95:443 READY -> IDLE I 2023-08-08T08:29:13.507Z | subchannel_refcount | (430) 172.217.214.95:443 refcount 2 -> 1 I 2023-08-08T08:29:13.507Z | pick_first | READY -> IDLE ... I 2023-08-08T08:29:26.006Z | load_balancing_call | [154308] Pick called I 2023-08-08T08:29:26.006Z | load_balancing_call | [154308] Pick result: COMPLETE subchannel: (477) 142.251.120.95:443 status: undefined undefined I 2023-08-08T08:29:26.006Z | connectivity_state | (3) dns:pubsub.googleapis.com:443 CONNECTING -> READY I 2023-08-08T08:29:26.105Z | transport_flowctrl | (506) 142.251.120.95:443 local window size: 65535 remote window size: 65535 I 2023-08-08T08:29:26.105Z | transport_internals | (506) 142.251.120.95:443 session.closed=false session.destroyed=false session.socket.destroyed=false ``` This is the packet capture for the connection to 142.251.120.95:443 ![26q77](https://github.com/grpc/grpc-node/assets/11001826/d3f4635a-aac5-4b36-9096-27cf16c8708a)

Similar story for the other pod

rcmzn trace.log line 4,378,056 ``` I 2023-08-08T13:36:56.006Z | transport | (869) 172.253.114.95:443 connection closed by GOAWAY with code 0 I 2023-08-08T13:36:56.006Z | subchannel | (850) 172.253.114.95:443 READY -> IDLE I 2023-08-08T13:36:56.006Z | subchannel_refcount | (850) 172.253.114.95:443 refcount 2 -> 1 I 2023-08-08T13:36:56.006Z | pick_first | READY -> IDLE ... I 2023-08-08T13:37:21.426Z | load_balancing_call | [286257] Pick called I 2023-08-08T13:37:21.426Z | load_balancing_call | [286257] Pick result: COMPLETE subchannel: (890) 173.194.195.95:443 status: undefined undefined I 2023-08-08T13:37:21.426Z | retrying_call | [286256] startRead called I 2023-08-08T13:37:21.426Z | load_balancing_call | [286257] startRead called I 2023-08-08T13:37:21.505Z | transport_flowctrl | (925) 173.194.195.95:443 local window size: 65535 remote window size: 65459 I 2023-08-08T13:37:21.505Z | transport_internals | (925) 173.194.195.95:443 session.closed=false session.destroyed=false session.socket.destroyed=false ``` This is the packet capture for the connection to 173.194.195.95:443 ![rcmzn](https://github.com/grpc/grpc-node/assets/11001826/8c20c367-abc3-4050-aa54-5703555c0380)

I'm going to type a few words of analysis but only worry about the "26q77" logs. The "rcmzn" case is extremely similar.

T08:29:14.606 client->server TCP starts the connection (SYN)
T08:29:15.805 client->server TLSv1.3 Client Hello
T08:29:15:806 server->client TLSv1.3 Server Hello
T08:29:15:806 client->server ACKs the Server Hello

Nothing happens on the network for several seconds

T08:29:24.622 server->client FIN,ACK. The server gives up. The client ACKs this

The server seems to have a 10 second timeout on establishing TLS

T08:29:26.006Z | load_balancing_call | [154308] Pick result: COMPLETE subchannel: (477) 142.251.120.95:443 status: undefined undefined

T08:29:26.105 client->server TLSv1.3 Change Cipher Spec, Finished

Client is finally done with the TLS handshake but it's too late!

On the same millisecond as above:

T08:29:26.105Z | transport_flowctrl | (506) 142.251.120.95:443 local window size: 65535 remote window size: 65535
T08:29:26.105Z | transport_internals | (506) 142.251.120.95:443 session.closed=false session.destroyed=false session.socket.destroyed=false

T08:29:26.106 server->client RST. The server sends reset. The server already sent FIN,ACK. Get outta here

I don't have proof but I have a feeling that node isn't being given enough cpu cycles to complete the handshake in time. There's noticeable gaps in the trace logs around T08:29:19.808 - T08:29:21.705. The resets from the server-sides happen to a lot (all?) of the inflight TCP connections.

another wireshark screenshot ![lots-of-resets](https://github.com/grpc/grpc-node/assets/11001826/f3c89dcb-4943-4972-85f6-3286b490e486)

I'm running my testing code with guaranteed QoS 20 mcpu (request=limit) but on a e2-medium which is a shared cpu instance. We run 2000 mcpu with guaranteed QoS in production on much larger vms, but the app is doing much more than pubsub publishes. I found rumors of node's tls handshake happening in the main thread.

Where I've not reproduced this behavior with my test script: running on a local pc, running in GKE autopilot, running in GKE standard with cpu request=20m and no limit. All of those would have fed more cpu cycles to node.

Node continuing to think session.closed=false session.destroyed=false session.socket.destroyed=false for the http2 session after the server sent a RST packet and the kernel has stopped tracking the TCP connection definitely seems totally incorrect. I gave it a good 15 minutes looking for nodejs issues that seemed similar but didn't come up with anything. I'm running the test code on node 18.16.1.

I hardly know what I'm doing with heapdumps. It seems like most of the memory is in buffers for outgoing requests that are never going to happen.

heapdump screenshot ![heapdump](https://github.com/grpc/grpc-node/assets/11001826/57565f36-fb24-4679-a1d7-874e518fb92f)
wwilfinger commented 1 year ago

https://github.com/wwilfinger/grpc-pubsub-fake-public/

I can reproduce locally consistently with this code. I had most of this already written so it's not a very minimal reproduction. The Golang gRPC server in this has a 1ms connection timeout configured. The NodeJS client can never (that I've seen) complete TLS within that time. After a handful of tries the client gets into the same state that I've seen while running in GKE and communicating with actual PubSub.

In the client code, uncomment the gRPC keepalive config and it will get itself out of the "stuck" state (good!).

Increase the server-side connection timeout to 1 second, everything works okay.

murgatroid99 commented 1 year ago

This is great data. I think these packet captures are the smoking gun we were looking for: the client misses the FIN,ACK packet and then misses or ignores the RST, and from then on treats the connection as still open and usable, even though it is not. The heap dump seems to show where the memory leak is coming from: the HTTP/2 implementation will buffer arbitrarily many outgoing packets if it thinks it has a writable socket.

I think this clearly shows that this is a Node bug, so I think you should file an issue in the Node repository. Do your packet captures show any further socket activity after the RST packet? The answer to that would indicate how the client is interacting with the socket in the long term.

Keepalives should always get a client stuck like this unstuck. The client will send the ping, and it won't get a response, and then gRPC will discard the connection and create a new one.

murgatroid99 commented 1 year ago

I managed to strace the client side of the reproduction with all gRPC tracers active, and the output is interesting, and might help with a Node issue. Here are the most relevant lines (not all consecutive):

socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 23
setsockopt(23, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(23, {sa_family=AF_INET6, sin6_port=htons(50051), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
write(23, "\26\3\1\1\177\1\0\1{\3\00316\23\230K\202\rS\215Y7y\24a\341\275\376\235/.\207"..., 388) = 388
epoll_wait(13, [{events=EPOLLIN, data={u32=23, u64=23}}], 1024, 1) = 1
read(23, "\26\3\3\0z\2\0\0v\3\3\346\325[\341\6\17\3353\225\245 $6\220\323\302\257\256\20\26\214"..., 65536) = 2332
D 2023-08-09T21:49:17.512Z | subchannel | (5) ::1:50051 CONNECTING -> READY
D 2023-08-09T21:49:17.512Z | pick_first | Pick subchannel with address ::1:50051
write(23, "\24\3\3\0\1\1\27\3\3\0005r\10\0\301=\346.\37\237\264\344\325\357\2025=o\353~R\230"..., 64) = 64
write(23, "\27\3\3\1\177\223[5\223\221\354\206\252\303:\330\231c\312{\243~\33\235\336~\r\371\360\322hx"..., 388) = -1 EPIPE (Broken pipe)
epoll_wait(13, [{events=EPOLLIN|EPOLLHUP, data={u32=23, u64=23}}, {events=EPOLLIN|EPOLLERR|EPOLLHUP, data={u32=24, u64=24}}, {events=EPOLLIN, data={u32=21, u64=21}}], 1024, 0) = 3
epoll_ctl(13, EPOLL_CTL_DEL, 23, 0x7ffc71511b70) = 0

So, FD 23 corresponds to the socket that appears to connect. Then writing to it results in EPIPE, and then the next epoll_wait call tries to watch that FD anyway. Immediately after that, an epoll_ctl call deletes FD 23 from the poll set. After that there are no more references to FD 23 in the strace output that I could see. So, it looks like something knows that that FD is unusable, but that information doesn't propagate up to the parts of the Node API that we see.

murgatroid99 commented 1 year ago

Based on the information in https://github.com/nodejs/node/issues/49147#issuecomment-1679515331, I published a change in grpc-js version 1.9.1 that defers all actions in the write callback using process.nextTick. Please try it out, to see if it improves anything.

P0rth0s commented 1 year ago

v1.9.1 has made this worse for me. With this version I am seeing this bug at much greater frequency than previously (I believe oldest iv seen it on is 1.8.2, but could be wrong).

I may be able to use this increased frequency to get around some of the packet capture issues I was running into.

murgatroid99 commented 1 year ago

@P0rth0s have you tried enabling keepalives, as suggested in https://github.com/grpc/grpc-node/issues/2502#issuecomment-1648577842? The argument to the PubSub constructor can also be passed as the third argument to the gRPC Client or Channel constructor for the same effect.

P0rth0s commented 1 year ago

@murgatroid99 Yes passing "grpc.keepalive_time_ms": 10000.

I am seeing this under very similar conditions to the ones linked in that other thread. Since I can get this within a day consistently on 1.9.1 it shouldn't be a problem for me to get a tcpdump without ssl enabled

murgatroid99 commented 1 year ago

The comment I linked is from earlier in this thread. Keepalives really should be fixing this. Remember that with those settings, a connection can stay in the bad state for up to 30 seconds before it is detected by keepalives. Can you confirm that you are seeing the problem persist for longer than that? Can you also confirm that keepalives are actually enabled by running your code with the environment variables GRPC_TRACE=keepalive and GRPC_VERBOSITY=DEBUG?

krrose27 commented 11 months ago

Hi @murgatroid99, I've picked up playing with this from @P0rth0s.

We've upgraded to 1.9.4 and have been doing some debugging and testing. I'm not 100% sure we are seeing the same issue or a newer behavior, possibly due to changes we've tried to make. Specifically, it appears now that we spin many connections very rapidly which in turn spins out CPU load even when we aren't making this amount of calls.

I've uploaded GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG to a gist.

murgatroid99 commented 11 months ago

It looks like you sorted that log. If so, that's more confusing than helpful, because it puts all of the events with the same timestamp in alphabetical order. Can you please share the unsorted output?

krrose27 commented 11 months ago

Sorry about that standard practice for most of our internal logs. https://gist.github.com/krrose27/b9e31023bccfbcda02fb828c5f6317d7

murgatroid99 commented 11 months ago

This is definitely a different problem than the one we were seeing earlier. It looks like for some reason the server is repeatedly accepting the connection and then closing it. The client doesn't back off when connection attempts succeed, so this is a bit of a pathological case where the client will just keep trying to connect. If the server instead refused the connections entirely, the client would start connecting with a default 1 second delay and back off exponentially to a default maximum of 2 minutes.

murgatroid99 commented 11 months ago

If you are running a Node server, and you call bindAsync but not start, it could result in that behavior. I suggest checking if that is the case, and if so, changing it.

Update: As of grpc-js 1.10.x, start is no longer required to start a server. bindAsync is sufficient.