grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.49k stars 649 forks source link

"New streams cannot be created after receiving a GOAWAY" and CANCELLED errors during service deployments #2694

Open pratik151192 opened 8 months ago

pratik151192 commented 8 months ago

Problem description

During our service deployments, the client is rejecting certain amount of requests with CANCELLED errors. Our servers are configured to send a GOAWAY frame to clients during deployments to gracefully terminate connections. On May 11 2023, we had bumped our gRPC-js version from 1.7.3 to 1.8.14. We reproduced our deployment scenario on multiple versions of our SDK, and the aforementioned gRPC version bump is the one where we start seeing the errors during deployments. We are not seeing this behavior with other SDKs.

I haven't bisected the gRPC-js versions themselves in between those version ranges, and I do see that there are 3k commits between the versions so hard to pinpoint a specific commit. I also see an open issue related to "New streams cannot be created after receiving a GOAWAY" but that's one part of the problem we witness. There's another ticket that seemed relevant. On our logs, we see two flavors of errors.

One is when the transport layer of gRPC detects a GOAWAY frame and logs it, but still is followed by a few CANCELLED errors. The other flavor is when the transport layer doesn't detect the frame explicitly, but we see new stream cannot be created after GOAWAY messages, followed by more cancelled errors. Sometimes we see both of these behaviors.

Here's a logStream from when we see both of them with http and gRPC traces:

with_cancelled_new_stream_errors.log

The below is a logStream where we don't see any cancelled or new stream errors from a version of gRPC earlier to 1.8.14. Notice that these do report INTERNAL errors sometimes (not as frequently or high as cancelled), but even that can be safely retried so isn't an issue. If there's a way where we can know if the cancelled errors are safe to retry for non-idempotent operations, it will be awesome.

without_cancelled_new_stream_errors_old_grpc_version.log

Let me know if you need more logs! I have attached as files to not clutter this space. I can easily reproduce this in my setup so will be happy to add more traces/scenarios that you'd like!

Reproduction steps

Environment

pratik151192 commented 8 months ago

@murgatroid99 Hey Michael apologies in advanced for tagging you explicitly but doing so as you had cut the ticket to node as well for theNew streams error.

murgatroid99 commented 8 months ago

You don't need to tag me. I get notified for every issue anyway.

The latest version is 1.10.3. Please verify that the issue still happens on that version and include updated trace logs.

pratik151192 commented 8 months ago

Thanks! Here are the logs:

A case on 1.10.3 when we see cancelled errors as well as new streams cannot be created

logs-1.10.3-case-1.log

Another case on 1.10.3 when we only see cancelled errors without the other new streams error

logs-1.10.3-case-2.log

murgatroid99 commented 8 months ago

The first log shows the "new streams cannot be created" error getting transparently retried. That is exactly the intended behavior, and the error is not surfaced to the application, so that shouldn't be a problem.

As for the "cancelled" errors, the logs show that the connection received a GOAWAY with code 8 right before those errors were surfaced. That is also the expected behavior when the client receives a GOAWAY like that. You said in the first post on this issue

Our servers are configured to send a GOAWAY frame to clients during deployments to gracefully terminate connections.

You don't give any details about this configuration, but it is common for servers gracefully closing connections to send a GOAWAY with code 0 to signal that no new streams should be opened, then wait some period of time (often referred to as a "grace period"), and then send a GOAWAY with code 8 to cancel all streams that are still open after that time. What we are seeing in this log is the expected outcome of that behavior.

pratik151192 commented 8 months ago

Thanks! Are these cancelled errors safe to retry? Also curious what changed in the gRPC versions between 1.7.3 and 1.8.14 as I can consistently reproduce the errors surfacing since that version upgrade, and no errors before that. The problem is our clients use non-idempotent APIs we provide such as increment, which we only retry for unavailable and internal as they are safe. I am not sure if we can deem these as safe to retry? And if not, curious if you have thoughts on handling these situations. Because currently our clients experience these erorrs every time we have a service deployment.

pratik151192 commented 8 months ago

This is from the framework that facilitates our server-side shutdown:

// According to http://httpwg.org/specs/rfc7540.html#GOAWAY:
//
// > A server that is attempting to gracefully shut down a connection
// > SHOULD send an initial GOAWAY frame with the last stream
// > identifier set to 2^31-1 and a NO_ERROR code. This signals to the
// > client that a shutdown is imminent and that initiating further
// > requests is prohibited. After allowing time for any in-flight
// > stream creation (at least one round-trip time), the server can
// > send another GOAWAY frame with an updated last stream identifier.
// > This ensures that a connection can be cleanly shut down without
// > losing requests.
self.inner.as_dyn().go_away(StreamId::MAX, Reason::NO_ERROR);

This results in a GOAWAY frame from the server to the client with http2 stream id 2147483647 and code 0. The next line that executes sets up the mechanism to stop accepting work. It sends a ping and then stops accepting streams pending receipt of the pong that follows the GOAWAY.

// We take the advice of waiting 1 RTT literally, and wait
// for a pong before proceeding.
self.inner.ping_pong.ping_shutdown();

There is a drain that accompanies this and keeps the connection alive while the old streams that are in progress are serviced. What I am not clear from the logs is which requests the client tried to work on using the older connection. I am also curious how other SDKs are handling this that doesn't result in cancelled.

murgatroid99 commented 8 months ago

I was assuming your server was one of the gRPC implementations. If it isn't, then I can't say anything about what the server is doing. What we have are the client logs, and the client logs say that the client received a GOAWAY with code 8.

pratik151192 commented 8 months ago

I should have been more specific - we do use tonic on the server which is a gRPC implementation https://github.com/hyperium/tonic. Do you have insight on a/ retrying or handling these errors and b/ difference between the older versions of grpc-js and other SDKs which are not throwing these errors every deployment. Can also confirm the server sends a GOAWAY with code 0 as I see the below log on some of the runs, but not every time (and not on the ones I sent above):

D 2024-03-13T21:58:37.363Z | v1.10.0 49477 | transport | (16) 35.82.79.231:443 connection closed by GOAWAY with code 0 and data undefined
murgatroid99 commented 8 months ago

I also should have been more specific. Tonic is an implementation of gRPC, but it is not one of the official gRPC implementations that we maintain/support.

I understand that the server sends a GOAWAY with code 0. I am saying that it also sends a GOAWAY with code 8.

Cancellation errors cannot be retried in general, because the client doesn't know how much of the request the server processed.

There isn't enough information in the logs to determine why you see different behavior in other libraries and older versions of Node gRPC. We see a small snippet of what the client sees, and we see that it gets a GOAWAY with code 8 and acts accordingly.

murgatroid99 commented 7 months ago

In case the problem you are experiencing is related to #2625, I recommend testing again with grpc-js 1.10.4.

pratik151192 commented 7 months ago

I tried with that but still see the cancelled errors so not totally related to that issue.