Open P0rth0s opened 1 year ago
Did you have all tracers enabled here, or only some of them? You said that this persists on retries; does the server log that it receives multiple requests when those retries occur? Do you have keepalives enabled on the client? Can you trace this interaction with Wireshark or tcpdump and share the dump log?
All tracers should be enabled, I have GRPC_TRACE="all" and GRPC_VERBOSITY="DEBUG".
Yes, retrying results in another set of approximately the same logs on both the client and server
I am not passing keepAlive. I will add it.
Will look into if it is possible for me to get a Wireshark trace.
Yes, retrying results in another set of approximately the same logs on both the client and server
OK, that probably means that the connection is still valid, which means that keepalives probably won't change anything.
Environment was accidentally removed from this bad state before I could capture tcpdump. Will get tcpdump asap when I repro again, might be a couple days.
I may be seeing this same issue, but it might be different. Let me know and I can open a separate issue if needed.
All these reproductions were using @gprc/grpc-js@1.8.17
, @google-cloud/pubsub@3.2.1
, and google-gax@3.6.0
. We've seen the issue in the wild for at least the past six months or so, going off my own memory.
We run a workload in GKE that serves REST requests and also publishes to PubSub. Very occasionally, about once a month, a single pod will go off the rails and continually log THIS DEADLINE_EXCEEDED error from gax-nodejs. PubSub messages are never published, the pod never recovers, and we need to kill the pod.
I wrote a small script that, in a loop, awaits a publish to PubSub and waits 500ms. I added error logging to gax-nodejs around HERE. I was able to reproduce. Every error was this stack trace from grpc-node v1.8.17
I found this example of adding channelz support (thank you!). Added that, reproduced again.
I'm not sure what would trigger what I think is the resolving_load_balancer
to re-resolve, but it looks like that's what happened in this case.
I'm running this on a zonal GKE cluster in us-central1-a and it will reproduce on at least one of ten pods within 12 hours. I've not been able to reproduce on any of: Local PC running on my consumer Internet line, GKE Autopilot in us-central1, regional GKE cluster in us-west1.
Please let me know what I can collect that would be most useful for you to debug. I don't have tracing turned on because I didn't want to give myself a big GCP logging bill, but I might be able to write the tracing to a file and get it that way.
The DEADLINE_EXCEEDED
error you linked from google-gax has the error text Total timeout of API ${apiName} exceeded ${retry.backoffSettings.totalTimeoutMillis} milliseconds before any response was received.
but the stack trace you shared has the error text Deadline exceeded
. It should only be one of those or the other, so can you clarify what you are seeing there?
The channelz error you are seeing may indicate a channelz bug in the client. That component is not well-tested, so I wouldn't be surprised. Can you double check that you cannot get info on any of the listed subchannels?
I'll look into this more on Monday.
can you clarify what you are seeing there?
I'm seeing both. Sorry about being unclear.
With no changes to gax-nodejs, I would only ever see the "Total timeout of API ${apiName} exceeded" error from HERE in gax.
Gax does not log or provide the caller for a way to log an error from grpc-js that gax is going to retry. That would be done in gax around this line HERE. There are seven retry codes configured for google.pubsub.v1.Publisher Publish set HERE. That means I don't know which of seven error codes are actually happening, no real idea of what's going on.
I added logging into gax (at that line) to log the error message gax is receiving from grpc-js and to also log some of the variables in scope at that point.
Here's a screenshot of the logging with my patch I had saved from earlier.
These are the retry settings I'm using.
initialRpcTimeoutMillis=5000 and totalTimeoutMillis=60000 means gax is retrying every roughly 5000ms. The log lines with the green details are the DEADLINE_EXEEDED errors from grpc-js HERE. After 12 tries at 5000ms, over 60,000 ms has passed, which exceeds totalTimeoutMillis so gax throws this error HERE.
Once a pod gets into this state it never recovers. Very similar to P0rth0s reporting "The error persists on retries until the server or client is restarted" which is why I added to this issue instead of creating a new one.
Thanks for the eyes on it!
Please try updating your dependencies so that you pick up @grpc/grpc-js
version 1.8.19
, and then enable keepalives. As far as I understand, you can do this with pubsub by constructing the instance like this:
const pubsubClient = new PubSub({ 'grpc.keepalive_timeout_ms': 10000, 'grpc.keepalive_time_ms': 30000 } as any);
The as any
is only needed if you are using TypeScript. If you are already passing other options to that constructor, these options can simply be added to the existing options object. The specific numbers there are suggested values, you can change them if necessary.
If that doesn't help, we can look into investigating further with trace logs.
@murgatroid99 I sent you an email containing packet captures.
Note I was originally getting DEADLINE_EXCEEDED errors but then started getting CANCELLED errors instead. However all other behavior appears consistent.
FWIW I was encountering this using the @opentelemetry/exporter-trace-otlp-grpc
package where all of a sudden every emission of trace data was met with the DEADLINE_EXCEEDED
error. There was also a memory leak when a service instance got into this state.
This was on version 1.8.18
but it was also happening as far back as 1.4.1
although it was less frequent with that version. To stabilize our services we moved to using the HTTP exporter instead of the GRPC one, but I'm bringing it up in case it can add some investigatory context in looking how the @opentelemetry/exporter-trace-otlp-grpc
has implemented connection handling and/or options passed to @grpc/grpc-js
.
@oldmantaiter From your description of the error, I think you would benefit from enabling keepalives, but unfortunately it looks like there isn't a way to inject options to the relevant gRPC client constructor call here. You may want to open an issue with that repository to get them to either set those options or allow you to set options.
@murgatroid99 - Yep, that's where I ended up and switched libraries to resolve our issue, I'll open an issue to ask them to allow setting GRPC options so others running into it might be able to enable keepalives in the future.
After a few goof-ups, I was able to reproduce and grab trace logs. This was with:
@grpc/grpc-js@1.8.19
. Only patches I made here was to write trace logs to a file and to set TARGET_RETAINED_TRACES to 128.initialRpcTimeoutMillis
I have a tarball of the trace logs available here: https://github.com/wwilfinger/grpc-node-deadline-exceeded/tree/main/trace-logs
Action starts here which is line 4,431,621 in the trace log file
I 2023-08-01T02:42:15.955Z | transport | (884) 74.125.70.95:443 connection closed by GOAWAY with code 0
I 2023-08-01T02:42:15.955Z | subchannel | (865) 74.125.70.95:443 READY -> IDLE
There's other examples earlier in this same trace log of a closed connection GOAWAY where gprc-js recovers just fine. Some lines I found interesting
I 2023-08-01T02:42:26.318Z | subchannel | (865) 74.125.70.95:443 CONNECTING -> READY
I 2023-08-01T02:42:26.820Z | subchannel | (887) 74.125.124.95:443 CONNECTING -> READY
I 2023-08-01T02:42:26.820Z | pick_first | Pick subchannel with address 74.125.124.95:443
...
I 2023-08-01T02:42:26.918Z | load_balancing_call | [289722] Pick result: COMPLETE subchannel: (887) 74.125.124.95:443 status: undefined undefined
I 2023-08-01T02:42:26.918Z | connectivity_state | (3) dns:pubsub.googleapis.com:443 CONNECTING -> READY
I 2023-08-01T02:42:28.319Z | transport_flowctrl | (918) 74.125.124.95:443 local window size: 65535 remote window size: 65535
I 2023-08-01T02:42:28.319Z | transport_internals | (918) 74.125.124.95:443 session.closed=false session.destroyed=false session.socket.destroyed=false
subchannel (887) is what was reported in channelz. Those (918) lines have the same remote ip and are repeated in the rest of the trace log. I see a Subchannel constructed with options
line for (887) but never for (918).
Thank you for all of this detailed information. It looks like subchannel 887 is definitely the culprit. Every single request initiated with it ends with DEADLINE_EXCEEDED
, starting right after it became "READY', so it may never have been properly connected. Unfortunately, there's no clear indication in the logs of why that happened, so this is going to take more investigation.
Fortunately, keepalives should fix this for you. If there really is no connection, the client should see that on the first ping it sends and go back to connecting again. You should use the options I suggested in https://github.com/grpc/grpc-node/issues/2502#issuecomment-1648577842.
I see a
Subchannel constructed with options
line for (887) but never for (918).
That is correct. 918 is the ID for a transport, not a subchannel. That corresponds to a "socket" in channelz, and you can see that ID in the socket_ref
in the channelz information for subchannel 887.
A few notes about the logging output, mostly for things I need to change:
transport
logs say [object Object]
instead of the ID number. That has been fixed, and the fix will go out in 1.9.0.grpcdebug
expects in a subchannel's address that doesn't match what this library is outputting.>
character doesn't render well in grpcdebug
output, so it might be good to change the format of those logs.CONNECTING -> CONNECTING
spam in channelz logs.Thanks so much for the detailed response. I'll turn on keepalives and see what happens.
That is correct. 918 is the ID for a transport, not a subchannel. That corresponds to a "socket" in channelz, and you can see that ID in the socket_ref in the channelz information for subchannel 887.
Ah okay, that makes sense!
Here's something interesting: the command grpcdebug localhost:5555 channelz subchannel 887
failed because it was trying to pretty-print the socket's local and remote addresses, and it fails because one of those values is nil
(null
in the protobuf message). With a normal working connection, both of those should have a non-null value.
(I think the keepalives will help our production deployment. I haven't seen it again in production but we were only seeing it about once a month or so anyway. I'm still interested in what's going on. I also know there's a lot of existing code not configuring keepalives.)
My test setup (running in GKE standard [not autopilot] cluster, us-central1, e2-medium node) reproduced twice overnight. I was able to get heapdumps, grpc traces, and packet captures for both reproductions.
I can email you the packet captures with the tls keylog and the heapdumps if requested. I believe the bearer token for the service account would be expired by now, but it's all over both of those I don't really want to post that all on the internet forever.
Trace logs, grpcdebug output, and some shell cmd output are available here. "26q77" and "rcmzn" were suffixes on the k8s pods that reproduced. This all looks similar to my previous reproduction, but I have packet captures.
Similar story for the other pod
I'm going to type a few words of analysis but only worry about the "26q77" logs. The "rcmzn" case is extremely similar.
T08:29:14.606 client->server TCP starts the connection (SYN)
T08:29:15.805 client->server TLSv1.3 Client Hello
T08:29:15:806 server->client TLSv1.3 Server Hello
T08:29:15:806 client->server ACKs the Server Hello
Nothing happens on the network for several seconds
T08:29:24.622 server->client FIN,ACK. The server gives up. The client ACKs this
The server seems to have a 10 second timeout on establishing TLS
T08:29:26.006Z | load_balancing_call | [154308] Pick result: COMPLETE subchannel: (477) 142.251.120.95:443 status: undefined undefined
T08:29:26.105 client->server TLSv1.3 Change Cipher Spec, Finished
Client is finally done with the TLS handshake but it's too late!
On the same millisecond as above:
T08:29:26.105Z | transport_flowctrl | (506) 142.251.120.95:443 local window size: 65535 remote window size: 65535
T08:29:26.105Z | transport_internals | (506) 142.251.120.95:443 session.closed=false session.destroyed=false session.socket.destroyed=false
T08:29:26.106 server->client RST. The server sends reset. The server already sent FIN,ACK. Get outta here
I don't have proof but I have a feeling that node isn't being given enough cpu cycles to complete the handshake in time. There's noticeable gaps in the trace logs around T08:29:19.808 - T08:29:21.705. The resets from the server-sides happen to a lot (all?) of the inflight TCP connections.
I'm running my testing code with guaranteed QoS 20 mcpu (request=limit) but on a e2-medium which is a shared cpu instance. We run 2000 mcpu with guaranteed QoS in production on much larger vms, but the app is doing much more than pubsub publishes. I found rumors of node's tls handshake happening in the main thread.
Where I've not reproduced this behavior with my test script: running on a local pc, running in GKE autopilot, running in GKE standard with cpu request=20m and no limit. All of those would have fed more cpu cycles to node.
Node continuing to think session.closed=false session.destroyed=false session.socket.destroyed=false
for the http2 session after the server sent a RST packet and the kernel has stopped tracking the TCP connection definitely seems totally incorrect. I gave it a good 15 minutes looking for nodejs issues that seemed similar but didn't come up with anything. I'm running the test code on node 18.16.1.
I hardly know what I'm doing with heapdumps. It seems like most of the memory is in buffers for outgoing requests that are never going to happen.
https://github.com/wwilfinger/grpc-pubsub-fake-public/
I can reproduce locally consistently with this code. I had most of this already written so it's not a very minimal reproduction. The Golang gRPC server in this has a 1ms connection timeout configured. The NodeJS client can never (that I've seen) complete TLS within that time. After a handful of tries the client gets into the same state that I've seen while running in GKE and communicating with actual PubSub.
In the client code, uncomment the gRPC keepalive config and it will get itself out of the "stuck" state (good!).
Increase the server-side connection timeout to 1 second, everything works okay.
This is great data. I think these packet captures are the smoking gun we were looking for: the client misses the FIN,ACK packet and then misses or ignores the RST, and from then on treats the connection as still open and usable, even though it is not. The heap dump seems to show where the memory leak is coming from: the HTTP/2 implementation will buffer arbitrarily many outgoing packets if it thinks it has a writable socket.
I think this clearly shows that this is a Node bug, so I think you should file an issue in the Node repository. Do your packet captures show any further socket activity after the RST packet? The answer to that would indicate how the client is interacting with the socket in the long term.
Keepalives should always get a client stuck like this unstuck. The client will send the ping, and it won't get a response, and then gRPC will discard the connection and create a new one.
I managed to strace the client side of the reproduction with all gRPC tracers active, and the output is interesting, and might help with a Node issue. Here are the most relevant lines (not all consecutive):
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 23
setsockopt(23, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(23, {sa_family=AF_INET6, sin6_port=htons(50051), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
write(23, "\26\3\1\1\177\1\0\1{\3\00316\23\230K\202\rS\215Y7y\24a\341\275\376\235/.\207"..., 388) = 388
epoll_wait(13, [{events=EPOLLIN, data={u32=23, u64=23}}], 1024, 1) = 1
read(23, "\26\3\3\0z\2\0\0v\3\3\346\325[\341\6\17\3353\225\245 $6\220\323\302\257\256\20\26\214"..., 65536) = 2332
D 2023-08-09T21:49:17.512Z | subchannel | (5) ::1:50051 CONNECTING -> READY
D 2023-08-09T21:49:17.512Z | pick_first | Pick subchannel with address ::1:50051
write(23, "\24\3\3\0\1\1\27\3\3\0005r\10\0\301=\346.\37\237\264\344\325\357\2025=o\353~R\230"..., 64) = 64
write(23, "\27\3\3\1\177\223[5\223\221\354\206\252\303:\330\231c\312{\243~\33\235\336~\r\371\360\322hx"..., 388) = -1 EPIPE (Broken pipe)
epoll_wait(13, [{events=EPOLLIN|EPOLLHUP, data={u32=23, u64=23}}, {events=EPOLLIN|EPOLLERR|EPOLLHUP, data={u32=24, u64=24}}, {events=EPOLLIN, data={u32=21, u64=21}}], 1024, 0) = 3
epoll_ctl(13, EPOLL_CTL_DEL, 23, 0x7ffc71511b70) = 0
So, FD 23 corresponds to the socket that appears to connect. Then writing to it results in EPIPE
, and then the next epoll_wait
call tries to watch that FD anyway. Immediately after that, an epoll_ctl
call deletes FD 23 from the poll set. After that there are no more references to FD 23 in the strace
output that I could see. So, it looks like something knows that that FD is unusable, but that information doesn't propagate up to the parts of the Node API that we see.
Based on the information in https://github.com/nodejs/node/issues/49147#issuecomment-1679515331, I published a change in grpc-js version 1.9.1 that defers all actions in the write callback using process.nextTick
. Please try it out, to see if it improves anything.
v1.9.1 has made this worse for me. With this version I am seeing this bug at much greater frequency than previously (I believe oldest iv seen it on is 1.8.2, but could be wrong).
I may be able to use this increased frequency to get around some of the packet capture issues I was running into.
@P0rth0s have you tried enabling keepalives, as suggested in https://github.com/grpc/grpc-node/issues/2502#issuecomment-1648577842? The argument to the PubSub
constructor can also be passed as the third argument to the gRPC Client or Channel constructor for the same effect.
@murgatroid99 Yes passing "grpc.keepalive_time_ms": 10000
.
I am seeing this under very similar conditions to the ones linked in that other thread. Since I can get this within a day consistently on 1.9.1 it shouldn't be a problem for me to get a tcpdump without ssl enabled
The comment I linked is from earlier in this thread. Keepalives really should be fixing this. Remember that with those settings, a connection can stay in the bad state for up to 30 seconds before it is detected by keepalives. Can you confirm that you are seeing the problem persist for longer than that? Can you also confirm that keepalives are actually enabled by running your code with the environment variables GRPC_TRACE=keepalive
and GRPC_VERBOSITY=DEBUG
?
Hi @murgatroid99, I've picked up playing with this from @P0rth0s.
We've upgraded to 1.9.4 and have been doing some debugging and testing. I'm not 100% sure we are seeing the same issue or a newer behavior, possibly due to changes we've tried to make. Specifically, it appears now that we spin many connections very rapidly which in turn spins out CPU load even when we aren't making this amount of calls.
I've uploaded GRPC_TRACE=all
and GRPC_VERBOSITY=DEBUG
to a gist.
It looks like you sorted that log. If so, that's more confusing than helpful, because it puts all of the events with the same timestamp in alphabetical order. Can you please share the unsorted output?
Sorry about that standard practice for most of our internal logs. https://gist.github.com/krrose27/b9e31023bccfbcda02fb828c5f6317d7
This is definitely a different problem than the one we were seeing earlier. It looks like for some reason the server is repeatedly accepting the connection and then closing it. The client doesn't back off when connection attempts succeed, so this is a bit of a pathological case where the client will just keep trying to connect. If the server instead refused the connections entirely, the client would start connecting with a default 1 second delay and back off exponentially to a default maximum of 2 minutes.
If you are running a Node server, and you call bindAsync
but not start
, it could result in that behavior. I suggest checking if that is the case, and if so, changing it.
Update: As of grpc-js 1.10.x, start
is no longer required to start a server. bindAsync
is sufficient.
Problem description
Intermittently our grpc client is entering a state where the server is sending a response to the client, but the client doesnt receive it and throws a DEADLINE_EXCEEDED error. The error persists on retries until the server or client is restarted.
Reproduction steps
Unknown - appears to eventually enter this state in longer lived environments.
Environment
Additional context
Client Logs
Server Logs
As you can see the server responds well within the deadline but the client never gets the response. I know transient failures can happen but since it persists on retries it appears there is something deeper going on here.