grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.39k stars 625 forks source link

Calls trigger ERR_HTTP2_SESSION_ERROR with node 20.10.0 #2625

Open AlbertLlebaria opened 7 months ago

AlbertLlebaria commented 7 months ago

Problem description

When running multiple gRPC requests asynchronously with the same gRPC client, gRPC client at some point starts cancelling thoe calls with the following error message: _Received RSTSTREAM with code 2 triggered by internal client error: Session closed with error code 2

I've enabled the following ENV variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG and these are the logs for one of the calls that received the RST_STREAM error.

D 2023-11-28T11:52:30.785Z | subchannel_call | [5567] write() called with message of length 433
D 2023-11-28T11:52:30.785Z | subchannel_call | [5567] sending data chunk of length 433
D 2023-11-28T11:52:31.297Z | subchannel_call | [5567] end() called
D 2023-11-28T11:52:31.297Z | subchannel_call | [5567] calling end() on HTTP/2 stream
D 2023-11-28T11:52:35.382Z | subchannel_call | [5567] Node error event: message=Session closed with error code 2 code=ERR_HTTP2_SESSION_ERROR errno=Unknown system error undefined syscall=undefined
D 2023-11-28T11:52:35.393Z | subchannel_call | [5567] HTTP/2 stream closed with code 2
D 2023-11-28T11:52:35.393Z | subchannel_call | [5567] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2"

Right before the calls start triggering the ERR_HTTP2_SESSION_ERROR , i found the following a log explaining the client got closed by GOAWAY with code 2. 0.0.0.0:3000 connection closed by GOAWAY with code 2.

D 2023-11-28T11:52:35.354Z | resolving_call | [6070] Received status
D 2023-11-28T11:52:35.354Z | resolving_call | [6070] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | subchannel_call | [6202] HTTP/2 stream closed with code 0
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] Received server trailers:
        grpc-status: 0
        grpc-message: OK

D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] received status code 0 from server
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] received status details string "OK" from server
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] close http2 stream with code 0
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | load_balancing_call | [6338] Received status
D 2023-11-28T11:52:35.354Z | load_balancing_call | [6338] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | retrying_call | [6337] Received status from child [6338]
D 2023-11-28T11:52:35.354Z | retrying_call | [6337] state=COMMITTED handling status with progress PROCESSED from child [6338] in state ACTIVE
D 2023-11-28T11:52:35.354Z | retrying_call | [6337] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | resolving_call | [6247] Received status
D 2023-11-28T11:52:35.354Z | resolving_call | [6247] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] HTTP/2 stream closed with code 0
D 2023-11-28T11:52:35.355Z | transport | (12) 0.0.0.0:3000 connection closed by GOAWAY with code 2
D 2023-11-28T11:52:35.355Z | subchannel | (11) 0.0.0.0:3000 READY -> IDLE
D 2023-11-28T11:52:35.355Z | subchannel_refcount | (11) 0.0.0.0:3000 refcount 2 -> 1
D 2023-11-28T11:52:35.355Z | pick_first | READY -> IDLE
D 2023-11-28T11:52:35.355Z | resolving_load_balancer | dns:0.0.0.0:3000 READY -> IDLE
D 2023-11-28T11:52:35.355Z | connectivity_state | (1) dns:0.0.0.0:3000 READY -> IDLE

You can find the full logs in the attached file: grpcLogs.txt

This error happened after upgrading our NodeJS project to version 20.10.0. This problem is only replicable when there are a ''large'' (2361 calls from the logs) amount of gRPC calls triggered. When there is a reduced amout of calls, I couldn't replicate the error.

Moreover, the ERR_HTTP2_SESSION_ERROR is not replicable with lower versions Node versions ( issue didn't appear when runnin on node 20.5.1 ).

Reproduction steps

Use Nodejs 20.10.0 and install grpc/grpc-js 1.9.12. Run multiple gRPC calls with the same client asynchornously. Issue is also replicable with other older versions of grpc/grpc-js such as 1.8.14 .

Environment

murgatroid99 commented 7 months ago

If your code works in one version of Node, and doesn't work in a later version of Node, my first assumption is that this is a regression in Node itself. Since log says that the session ended with a GOAWAY with error code 2 (internal error), that implies that it received that error from the server, though it is possible that the runtime synthesized that GOAWAY on the client. If it did come from the server, then maybe the client behaves differently in Node 20 vs earlier versions of Node that causes the server error.

valentingc commented 6 months ago

I am facing exactly the same error in my application that involves a high degree of concurrency. In my case, I see exactly the same error/log output with Node version 18.19.0 and Node 20.10.0. It works if I fix the Node version at 18.16.0 - I did not try the versions between 18.16.0 and 18.19.0, but the root cause of the problem must be in a code change between any of those versions I suppose.

I have investigated the traffic with Wireshark and the server sends a HTTP2 GOAWAY with Code 2. The @grpc/grpc-js client then throws the mentioned exception and in my case does not retry the request. I have tried @grpc/grpc-js versions 1.7.3. and 1.9.13, the behavior is the same. Behavior is the same on Windows 11, Ubuntu 20.04, and Ubuntu 22.04. It does not matter if the client manages the channel on its own or if I override the channel for all concurrent requests to the same one.

My logs are identical to the one of @AlbertLlebaria. Please let me know when there is something else that I can provide to further investigate the issue. Thanks a lot! :)

sagarjhaa commented 6 months ago

I am also facing the same issue in my application and mostly during peak usage. I am getting the same exception which causing context cancellation for other upstream services. I am on node version 20.10.0 and using grpc-js version 1.9.13.

Please let me know if I can provide any more context or help in investigation of the issue. Thank you.

murgatroid99 commented 6 months ago

Practically speaking, I think the only real impact of this bug is a slightly confusing error description. The logs show that the client received a GOAWAY with code 2. When Node receives that, it also ends any open streams on that connection with code 2. gRPC translates the HTTP2 code 2 into the gRPC status code 13, which is what we are seeing here. The confusing part is that this should be reported as an internal server error, but the gRPC client is interpreting it as an internal client error.

AVVS commented 4 months ago

I can recreate this error fairly easily and the cause of it is that under load when you have many streams close at the same time (especially via RST_STREAM) - session socket would receive UV_EOF, then that session will be terminated with GOAWAY sent from nodejs server. If you are interested - there is a little thread here: https://twitter.com/v_aminev/status/1757476612671492390

Typically I can recreate this on my local machine with ~25000 sessions closing at the same time. My gut tells me it has something to do with ACKs not happening on time or some sort of timeout/but in libuv, where it would raise UV_EOF (technically it should be FIN packet received by the stream, but that is not happening here)

murgatroid99 commented 4 months ago

@AVVS A GOAWAY with a non-zero error code causing open streams to close is normal behavior, at least at the protocol level. We don't need to look for any lower-level failures to explain why this is happening.

AVVS commented 4 months ago

Not what I mean here. Scenario is the following:

  1. Session opens
  2. 75000 (example) server streams are opened with a deadline of 30s (example)
  3. These streams are closed by client and 75000 RST_STREAM with error code 0 are sent to server
  4. Server starts terminating streams
  5. At some point session receives UV_EOF on read (has to be FIN packet, but its not sent)
  6. Server sends goaway, finishes graceful termination of streams, closes session
  7. Gets fin ack
  8. Sends fin ack

The error is UV_EOF on step 5 - stream is perfectly fine, no goaway was received or sent yet

AVVS commented 4 months ago

https://twitter.com/v_aminev/status/1759772087512686736 - a little more context, first link mihgt not have opened the full thread.

0.24 is the grpc server ip 0.9 is envoy proxying stuff over to grpc server

many RST_STREAM -> 0.24 (due to grpc-timeout header for server stream, load testing purposes) -- sudden UV_EOF in ::OnStreamRead handled by http2 server by issuing GOAWAY[0] -- RST_STREAM[8] for still opened streams are sent out

To sum up - I don't think its a bug in grpc-js, but I do think there is something wrong with libuv handling of the opened socket, or maybe I am simply not aware of some sort of settings that I could change. This doesn't happen if the node process is not overloaded, so maybe there is a timeout or something like that where a stream read is scheduled, callback is not invoked for one reason or another and it throws UV_EOF

Screenshot 2024-02-19 at 18 44 29 Screenshot 2024-02-19 at 17 03 32 Screenshot 2024-02-19 at 17 03 28
murgatroid99 commented 4 months ago

And you're saying that this UV_EOF event causes Node to report that it has received a GOAWAY with error code 2, even though it hasn't?

AVVS commented 4 months ago

It's unlikely that this is directly related (because of GOAWAY code 2), but I thought it might be somewhat related as "peak usage" was mentioned later - should be related to that, rather than the original issue

murgatroid99 commented 4 months ago

If the things you're describing aren't related to the original issue, and you don't think they're a bug in grpc-js, why are you posting it here?

AVVS commented 4 months ago

@murgatroid99 on the off chance that it is related. Even if it's not - I think it might be helpful in debugging for others - specifically they could turn on NODE_DEBUG_NATIVE=HTTP2SESSION,HTTP2STREAM NODE_DEBUG=http,net in env, this will give more logs and/or gather a little bit more with wireshark or similar inspection tools so that the reason for goaway[2] is clear

EDIT: if you think whats been posted is not helpful at all - I'll just clean the messages up, sorry

bbit8 commented 3 months ago

@murgatroid99 handleTrailers Should here call halfClose() instead of endCall() ? endCall will send RST_STREAM, not END_STREAM

murgatroid99 commented 3 months ago

@bbit8 No. When the client receives trailers on a stream, that indicates that the stream has finished. endCall propagates the status up the stack, and cleans up the stream. It might send an RST_STREAM, but that shouldn't have any effect.

bbit8 commented 3 months ago

@bbit8 No. When the client receives trailers on a stream, that indicates that the stream has finished. endCall propagates the status up the stack, and cleans up the stream. It might send an RST_STREAM, but that shouldn't have any effect.

@murgatroid99 When in high load, RST_STREAM will trigger this silly rate limiter in node.js. Is that possible? rate limiter here -> rate limiter

murgatroid99 commented 3 months ago

Maybe it could, but it shouldn't. The rate limit was added to prevent the HTTP/2 rapid reset attack, which involves a client creating streams and then immediately closing them, many times.

Besides, that code isn't sending the RST_STREAM explicitly. It's just calling destroy to clean up whatever resources are associated with the stream, and Node internally decides that sending an RST_STREAM is the right thing to do there.

bbit8 commented 3 months ago

@murgatroid99 endCall() -> destroyHttp2Stream() -> http2Stream.close() . http2Stream.close() will send RST_STREAM. This is in the Node http2 document. here.

The rate limiter will send GOAWAY when receiving excessive amount of RST_STREAM, no matter who sends the RST_STREAM.

So when in high load, grpc-js client receives excessive amount of response, then gets excessive amount of RST_STREAM, then triggers the rate limiter in Node.

murgatroid99 commented 3 months ago

I honestly forgot I had that in there.

destroyHttp2Stream does need to call http2Stream.close sometimes, to deal with cancellations. The condition should probably be that it makes that call if the client has not received trailers or a close event from the server.

StefanPrintezis commented 3 months ago

I started noticing this error occasionally on bursts of sequential workloads, 1000 individual calls per second from one client to the same server.

Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2

It started occuring after we updated envoy on both server and client from v1.27.2 to v1.27.3, i cannot correlate anything from the diff or changelog back to this, I haven't looked in the dependency upgrades yet.

Using a rate limiter was a solution for the short term but i'm hoping the redundant RST_STREAM #2695 PR will prevent envoy from "GOAWAY"ing the client with the burst sequential workload.

murgatroid99 commented 3 months ago

I just published the change in #2695 in grpc-js version 1.10.4. Please try it out.

BrodaNoel commented 3 months ago

@murgatroid99 in order to get this update in Firebase's dependencies, I ran npm update, which updated to 1.10.4.

If the error persists, I'll be back.

BrodaNoel commented 3 months ago

We gonna have a problem with NodeJS 18. It uses 1.8.21, because firebase-admin@11

➜  functions git:(master) ✗ npm why @grpc/grpc-js
@grpc/grpc-js@1.8.21
node_modules/@grpc/grpc-js
  @grpc/grpc-js@"~1.8.0" from google-gax@3.6.1
  node_modules/google-gax
    google-gax@"^3.5.7" from @google-cloud/firestore@6.8.0
    node_modules/@google-cloud/firestore
      optional @google-cloud/firestore@"^6.8.0" from firebase-admin@11.11.1
BrodaNoel commented 3 months ago

@murgatroid99 seems like the error still there. I'm getting this after upgrading to 1.10.4:

 Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
    at callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:192:76)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
    at /workspace/node_modules/@grpc/grpc-js/build/src/resolving-call.js:99:78
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11) 
robrichard commented 1 month ago

We are also seeing this error 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error) after updating from nodejs 18.15.0 to 20.11.1 using grpc-js@1.10.5. It seems like there is still an issue related to the rapid reset rate limiter. Our affected codebase only uses grpc-js as a client, it does not host a grpc server. We see a steady rate of these errors while performing a large number of grpc client calls.

alexeevg commented 2 weeks ago

We're seeing a similar issue, happens on both node 18 and 20 using grpc-js@1.10.9.

In our setup, Envoy proxy is configured with max_requests_per_connection = 10000. After every 10000 unary calls on a TCP connection it initiates a graceful shutdown with a GOAWAY with no error code and the last stream id = 2^31 - 1, as specified in HTTP/2 RFC.

The problem is gprc-js (or maybe even node.js http2 module) doesn't handle graceful shutdown via GOAWAY very well. Some in-flight requests are failing with Received RST_STREAM with code 2 (Internal server error) when the connection is closed. We checked the traffic dump and didn't see any RST_STREAM's sent from server side, the errors seem to be caused by the way the node.js client handles the GOAWAY.

Here's the fragment of the client log with GRPC_TRACE=all and NODE_DEBUG=http2:

HTTP2 69: Http2Session client: initiating request
HTTP2 69: Http2Session client: connected, initializing request
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | transport_flowctrl | (6) 10.43.160.91:81 local window size: 57965 remote window size: 268385230
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | transport_internals | (6) 10.43.160.91:81 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | load_balancing_call | [80222] Created child call [80223]
D 2024-06-18T15:52:19.735Z | v1.10.9 69 | retrying_call | [80221] write() called with message of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | load_balancing_call | [80222] write() called with message of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel_call | [80223] write() called with message of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel_call | [80223] sending data chunk of length 5
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | retrying_call | [80221] halfClose called
HTTP2 69: Http2Session client: goaway 0 received [last stream id: 2147483647]
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | transport | (6) 10.43.160.91:81 connection closed by GOAWAY with code 0 and data undefined
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel | (3) 10.43.160.91:81 READY -> IDLE
HTTP2 69: Http2Session client: marking session closed
HTTP2 69: Http2Session client: submitting goaway
D 2024-06-18T15:52:19.736Z | v1.10.9 69 | subchannel_refcount | (3) 10.43.160.91:81 refcount 2 -> 1
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | pick_first | READY -> IDLE
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | resolving_load_balancer | dns:10.43.160.91:81 READY -> IDLE
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | connectivity_state | (1) dns:10.43.160.91:81 READY -> IDLE
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | resolving_load_balancer | dns:10.43.160.91:81 IDLE -> CONNECTING
D 2024-06-18T15:52:19.737Z | v1.10.9 69 | connectivity_state | (1) dns:10.43.160.91:81 IDLE -> CONNECTING
HTTP2 69: Http2Session client: error sending frame type 1 on stream 20001, code: 2
HTTP2 69: Http2Stream 20001 [Http2Session client]: closed with code 7, closed true, readable true
HTTP2 69: Http2Stream 20001 [Http2Session client]: destroying stream
D 2024-06-18T15:52:19.739Z | v1.10.9 69 | subchannel_call | [80223] HTTP/2 stream closed with code 2
D 2024-06-18T15:52:19.739Z | v1.10.9 69 | subchannel_call | [80223] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)"
D 2024-06-18T15:52:19.739Z | v1.10.9 69 | load_balancing_call | [80222] Received status
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | load_balancing_call | [80222] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)" start time=2024-06-18T15:52:19.735Z
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | retrying_call | [80221] Received status from child [80222]
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | retrying_call | [80221] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [80222] in state ACTIVE
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | retrying_call | [80221] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)" start time=2024-06-18T15:52:19.735Z
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | resolving_call | [80220] Received status
D 2024-06-18T15:52:19.740Z | v1.10.9 69 | resolving_call | [80220] ended with status: code=13 details="Received RST_STREAM with code 2 (Internal server error)"

I wonder if there's any workaround or configuration to fix this. We're willing to retry the failed requests, but the internal server error is a bad error code to retry.

murgatroid99 commented 2 weeks ago

I think the key line is "error sending frame type 1 on stream 20001, code: 2"

Here's what I think is happening in that log: the client trying to start the 10,001st stream races with the incoming GOAWAY, and when the client tries to send the headers for that stream, it causes this error, which Node reports as an INTERNAL error (indistinguishable from an internal server error). I don't think there's much gRPC can do about this: the best fix would be for the http2 module to report a different error code in this situation.

alexeevg commented 1 day ago

@murgatroid99 thanks for your response!

I reproduced the issue using pure grpc-js, please see this commit in my fork.

Basically we need 'grpc.max_connection_age_ms': 500 on server to trigger many GOAWAY frames and enough concurrent requests from client to trigger the race condition resulting in RST_STREAM with code 2 and CANCELLED errors:

D 2024-07-06T14:25:38.602Z | v1.10.10 98559 | transport | (37) 127.0.0.1:50052 connection closed by GOAWAY with code 0 and data max_age
D 2024-07-06T14:25:38.602Z | v1.10.10 98559 | transport | dns:localhost:50052 creating HTTP/2 session to 127.0.0.1:50052
13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
1 CANCELLED: Call cancelled

Running the client with http2 debug logs produces the same errors as you mentioned:

NODE_DEBUG=http2 GRPC_TRACE=transport GRPC_VERBOSITY=DEBUG node client.js 2>&1 | grep 'error sending'                                                    130 ⨯
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1889, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1891, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1893, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1895, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1897, code: 2
HTTP2 99942: Http2Session client: error sending frame type 1 on stream 1899, code: 2

I find the discussions you had with node.js maintainers here and here particularly relevant. Do you think it makes sense to revive the discussion with this code example? Maybe I can spend some time and turn it to pure node:http2 issue demo, although I'm having hard time understanding what exactly the contract of http2 module is regarding the goaway handling.