Open AlbertLlebaria opened 7 months ago
If your code works in one version of Node, and doesn't work in a later version of Node, my first assumption is that this is a regression in Node itself. Since log says that the session ended with a GOAWAY with error code 2 (internal error), that implies that it received that error from the server, though it is possible that the runtime synthesized that GOAWAY on the client. If it did come from the server, then maybe the client behaves differently in Node 20 vs earlier versions of Node that causes the server error.
I am facing exactly the same error in my application that involves a high degree of concurrency. In my case, I see exactly the same error/log output with Node version 18.19.0 and Node 20.10.0. It works if I fix the Node version at 18.16.0 - I did not try the versions between 18.16.0 and 18.19.0, but the root cause of the problem must be in a code change between any of those versions I suppose.
I have investigated the traffic with Wireshark and the server sends a HTTP2 GOAWAY with Code 2. The @grpc/grpc-js client then throws the mentioned exception and in my case does not retry the request. I have tried @grpc/grpc-js versions 1.7.3. and 1.9.13, the behavior is the same. Behavior is the same on Windows 11, Ubuntu 20.04, and Ubuntu 22.04. It does not matter if the client manages the channel on its own or if I override the channel for all concurrent requests to the same one.
My logs are identical to the one of @AlbertLlebaria. Please let me know when there is something else that I can provide to further investigate the issue. Thanks a lot! :)
I am also facing the same issue in my application and mostly during peak usage. I am getting the same exception which causing context cancellation for other upstream services. I am on node version 20.10.0 and using grpc-js version 1.9.13
.
Please let me know if I can provide any more context or help in investigation of the issue. Thank you.
Practically speaking, I think the only real impact of this bug is a slightly confusing error description. The logs show that the client received a GOAWAY with code 2. When Node receives that, it also ends any open streams on that connection with code 2. gRPC translates the HTTP2 code 2 into the gRPC status code 13, which is what we are seeing here. The confusing part is that this should be reported as an internal server error, but the gRPC client is interpreting it as an internal client error.
I can recreate this error fairly easily and the cause of it is that under load when you have many streams close at the same time (especially via RST_STREAM) - session socket would receive UV_EOF, then that session will be terminated with GOAWAY sent from nodejs server. If you are interested - there is a little thread here: https://twitter.com/v_aminev/status/1757476612671492390
Typically I can recreate this on my local machine with ~25000 sessions closing at the same time. My gut tells me it has something to do with ACKs not happening on time or some sort of timeout/but in libuv, where it would raise UV_EOF (technically it should be FIN packet received by the stream, but that is not happening here)
@AVVS A GOAWAY with a non-zero error code causing open streams to close is normal behavior, at least at the protocol level. We don't need to look for any lower-level failures to explain why this is happening.
Not what I mean here. Scenario is the following:
The error is UV_EOF on step 5 - stream is perfectly fine, no goaway was received or sent yet
https://twitter.com/v_aminev/status/1759772087512686736 - a little more context, first link mihgt not have opened the full thread.
0.24
is the grpc server ip
0.9
is envoy proxying stuff over to grpc server
many RST_STREAM -> 0.24 (due to grpc-timeout header for server stream, load testing purposes) -- sudden UV_EOF in ::OnStreamRead handled by http2 server by issuing GOAWAY[0] -- RST_STREAM[8] for still opened streams are sent out
To sum up - I don't think its a bug in grpc-js, but I do think there is something wrong with libuv handling of the opened socket, or maybe I am simply not aware of some sort of settings that I could change. This doesn't happen if the node process is not overloaded, so maybe there is a timeout or something like that where a stream read is scheduled, callback is not invoked for one reason or another and it throws UV_EOF
And you're saying that this UV_EOF event causes Node to report that it has received a GOAWAY with error code 2, even though it hasn't?
It's unlikely that this is directly related (because of GOAWAY code 2), but I thought it might be somewhat related as "peak usage" was mentioned later - should be related to that, rather than the original issue
If the things you're describing aren't related to the original issue, and you don't think they're a bug in grpc-js, why are you posting it here?
@murgatroid99 on the off chance that it is related. Even if it's not - I think it might be helpful in debugging for others - specifically they could turn on NODE_DEBUG_NATIVE=HTTP2SESSION,HTTP2STREAM NODE_DEBUG=http,net in env, this will give more logs and/or gather a little bit more with wireshark or similar inspection tools so that the reason for goaway[2] is clear
EDIT: if you think whats been posted is not helpful at all - I'll just clean the messages up, sorry
@murgatroid99 handleTrailers Should here call halfClose() instead of endCall() ? endCall will send RST_STREAM, not END_STREAM
@bbit8 No. When the client receives trailers on a stream, that indicates that the stream has finished. endCall
propagates the status up the stack, and cleans up the stream. It might send an RST_STREAM, but that shouldn't have any effect.
@bbit8 No. When the client receives trailers on a stream, that indicates that the stream has finished.
endCall
propagates the status up the stack, and cleans up the stream. It might send an RST_STREAM, but that shouldn't have any effect.
@murgatroid99 When in high load, RST_STREAM will trigger this silly rate limiter in node.js. Is that possible? rate limiter here -> rate limiter
Maybe it could, but it shouldn't. The rate limit was added to prevent the HTTP/2 rapid reset attack, which involves a client creating streams and then immediately closing them, many times.
Besides, that code isn't sending the RST_STREAM explicitly. It's just calling destroy
to clean up whatever resources are associated with the stream, and Node internally decides that sending an RST_STREAM is the right thing to do there.
@murgatroid99 endCall() -> destroyHttp2Stream() -> http2Stream.close() . http2Stream.close() will send RST_STREAM. This is in the Node http2 document. here.
The rate limiter will send GOAWAY when receiving excessive amount of RST_STREAM, no matter who sends the RST_STREAM.
So when in high load, grpc-js client receives excessive amount of response, then gets excessive amount of RST_STREAM, then triggers the rate limiter in Node.
I honestly forgot I had that in there.
destroyHttp2Stream
does need to call http2Stream.close
sometimes, to deal with cancellations. The condition should probably be that it makes that call if the client has not received trailers or a close
event from the server.
I started noticing this error occasionally on bursts of sequential workloads, 1000 individual calls per second from one client to the same server.
Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2
It started occuring after we updated envoy on both server and client from v1.27.2 to v1.27.3, i cannot correlate anything from the diff or changelog back to this, I haven't looked in the dependency upgrades yet.
Using a rate limiter was a solution for the short term but i'm hoping the redundant RST_STREAM #2695 PR will prevent envoy from "GOAWAY"ing the client with the burst sequential workload.
I just published the change in #2695 in grpc-js version 1.10.4. Please try it out.
@murgatroid99 in order to get this update in Firebase's dependencies, I ran npm update
, which updated to 1.10.4
.
If the error persists, I'll be back.
We gonna have a problem with NodeJS 18. It uses 1.8.21
, because firebase-admin@11
➜ functions git:(master) ✗ npm why @grpc/grpc-js
@grpc/grpc-js@1.8.21
node_modules/@grpc/grpc-js
@grpc/grpc-js@"~1.8.0" from google-gax@3.6.1
node_modules/google-gax
google-gax@"^3.5.7" from @google-cloud/firestore@6.8.0
node_modules/@google-cloud/firestore
optional @google-cloud/firestore@"^6.8.0" from firebase-admin@11.11.1
@murgatroid99 seems like the error still there. I'm getting this after upgrading to 1.10.4
:
Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
at callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:192:76)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
at /workspace/node_modules/@grpc/grpc-js/build/src/resolving-call.js:99:78
at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
We are also seeing this error 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
after updating from nodejs 18.15.0 to 20.11.1 using grpc-js@1.10.5
. It seems like there is still an issue related to the rapid reset rate limiter. Our affected codebase only uses grpc-js as a client, it does not host a grpc server. We see a steady rate of these errors while performing a large number of grpc client calls.
We're seeing a similar issue, happens on both node 18 and 20 using grpc-js@1.10.9
.
In our setup, Envoy proxy is configured with max_requests_per_connection = 10000
. After every 10000 unary calls on a TCP connection it initiates a graceful shutdown with a GOAWAY
with no error code and the last stream id = 2^31 - 1, as specified in HTTP/2 RFC.
The problem is gprc-js
(or maybe even node.js http2
module) doesn't handle graceful shutdown via GOAWAY
very well. Some in-flight requests are failing with Received RST_STREAM with code 2 (Internal server error)
when the connection is closed. We checked the traffic dump and didn't see any RST_STREAM
's sent from server side, the errors seem to be caused by the way the node.js client handles the GOAWAY
.
Here's the fragment of the client log with GRPC_TRACE=all
and NODE_DEBUG=http2
:
HTTP2 69: Http2Session client: initiating request
HTTP2 69: Http2Session client: connected, initializing request
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | transport_flowctrl | (6) 10.43.160.91:81 local window size: 57965 remote window size: 268385230
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | transport_internals | (6) 10.43.160.91:81 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | load_balancing_call | [80222] Created child call [80223]
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | retrying_call | [80221] write() called with message of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | load_balancing_call | [80222] write() called with message of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel_call | [80223] write() called with message of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel_call | [80223] sending data chunk of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | retrying_call | [80221] halfClose called
HTTP2 69: Http2Session client: goaway 0 received [last stream id: 2147483647]
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | transport | (6) 10.43.160.91:81 connection closed by GOAWAY with code 0 and data undefined
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel | (3) 10.43.160.91:81 READY -> IDLE
HTTP2 69: Http2Session client: marking session closed
HTTP2 69: Http2Session client: submitting goaway
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel_refcount | (3) 10.43.160.91:81 refcount 2 -> 1
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | pick_first | READY -> IDLE
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | resolving_load_balancer | dns:10.43.160.91:81 READY -> IDLE
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | connectivity_state | (1) dns:10.43.160.91:81 READY -> IDLE
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | resolving_load_balancer | dns:10.43.160.91:81 IDLE -> CONNECTING
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | connectivity_state | (1) dns:10.43.160.91:81 IDLE -> CONNECTING
HTTP2 69: Http2Session client: error sending frame type 1 on stream 20001, code: 2
HTTP2 69: Http2Stream 20001 [Http2Session client]: closed with code 7, closed true, readable true
HTTP2 69: Http2Stream 20001 [Http2Session client]: destroying stream
D 2024-06-18T15:52:19.739Z | v1.10.9 69 | subchannel_call | [80223] HTTP/2 stream closed with code 2
D 2024-06-18T15:52:19.739Z | v1.10.9 69 | subchannel_call | [80223] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)"
D 2024-06-18T15:52:19.739Z | v1.10.9 69 | load_balancing_call | [80222] Received status
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | load_balancing_call | [80222] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)" start time=2024-06-18T15:52:19.735Z
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | retrying_call | [80221] Received status from child [80222]
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | retrying_call | [80221] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [80222] in state ACTIVE
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | retrying_call | [80221] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)" start time=2024-06-18T15:52:19.735Z
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | resolving_call | [80220] Received status
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | resolving_call | [80220] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)"
I wonder if there's any workaround or configuration to fix this.
We're willing to retry the failed requests, but the internal server error
is a bad error code to retry.
I think the key line is "error sending frame type 1 on stream 20001, code: 2"
Here's what I think is happening in that log: the client trying to start the 10,001st stream races with the incoming GOAWAY, and when the client tries to send the headers for that stream, it causes this error, which Node reports as an INTERNAL error (indistinguishable from an internal server error). I don't think there's much gRPC can do about this: the best fix would be for the http2 module to report a different error code in this situation.
@murgatroid99 thanks for your response!
I reproduced the issue using pure grpc-js
, please see this commit in my fork.
Basically we need 'grpc.max_connection_age_ms': 500
on server to trigger many GOAWAY
frames and enough concurrent requests from client to trigger the race condition resulting in RST_STREAM with code 2
and CANCELLED
errors:
D 2024-07-06T14:25:38.602Z | v1.10.10 98559 | transport | (37) 127.0.0.1:50052 connection closed by GOAWAY with code 0 and data max_age
D 2024-07-06T14:25:38.602Z | v1.10.10 98559 | transport | dns:localhost:50052 creating HTTP/2 session to 127.0.0.1:50052
13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
1 CANCELLED: Call cancelled
Running the client with http2
debug logs produces the same errors as you mentioned:
NODE_DEBUG=http2 GRPC_TRACE=transport GRPC_VERBOSITY=DEBUG node client.js 2>&1 | grep 'error sending' 130 ⨯
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1889, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1891, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1893, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1895, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1897, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1899, code: 2
I find the discussions you had with node.js maintainers here and here particularly relevant.
Do you think it makes sense to revive the discussion with this code example? Maybe I can spend some time and turn it to pure node:http2
issue demo, although I'm having hard time understanding what exactly the contract of http2
module is regarding the goaway
handling.
Problem description
When running multiple gRPC requests asynchronously with the same gRPC client, gRPC client at some point starts cancelling thoe calls with the following error message: _Received RSTSTREAM with code 2 triggered by internal client error: Session closed with error code 2
I've enabled the following ENV variables
GRPC_TRACE=all
andGRPC_VERBOSITY=DEBUG
and these are the logs for one of the calls that received the RST_STREAM error.Right before the calls start triggering the
ERR_HTTP2_SESSION_ERROR
, i found the following a log explaining the client got closed by GOAWAY with code 2.0.0.0.0:3000 connection closed by GOAWAY with code 2
.You can find the full logs in the attached file: grpcLogs.txt
This error happened after upgrading our NodeJS project to version 20.10.0. This problem is only replicable when there are a ''large'' (2361 calls from the logs) amount of gRPC calls triggered. When there is a reduced amout of calls, I couldn't replicate the error.
Moreover, the
ERR_HTTP2_SESSION_ERROR
is not replicable with lower versions Node versions ( issue didn't appear when runnin on node 20.5.1 ).Reproduction steps
Use Nodejs 20.10.0 and install grpc/grpc-js 1.9.12. Run multiple gRPC calls with the same client asynchornously. Issue is also replicable with other older versions of grpc/grpc-js such as 1.8.14 .
Environment