Open pratik151192 opened 8 months ago
@murgatroid99 Hey Michael apologies in advanced for tagging you explicitly but doing so as you had cut the ticket to node
as well for theNew streams
error.
You don't need to tag me. I get notified for every issue anyway.
The latest version is 1.10.3. Please verify that the issue still happens on that version and include updated trace logs.
Thanks! Here are the logs:
A case on 1.10.3 when we see cancelled
errors as well as new streams cannot be created
Another case on 1.10.3 when we only see cancelled
errors without the other new streams
error
The first log shows the "new streams cannot be created" error getting transparently retried. That is exactly the intended behavior, and the error is not surfaced to the application, so that shouldn't be a problem.
As for the "cancelled" errors, the logs show that the connection received a GOAWAY with code 8 right before those errors were surfaced. That is also the expected behavior when the client receives a GOAWAY like that. You said in the first post on this issue
Our servers are configured to send a
GOAWAY
frame to clients during deployments to gracefully terminate connections.
You don't give any details about this configuration, but it is common for servers gracefully closing connections to send a GOAWAY with code 0 to signal that no new streams should be opened, then wait some period of time (often referred to as a "grace period"), and then send a GOAWAY with code 8 to cancel all streams that are still open after that time. What we are seeing in this log is the expected outcome of that behavior.
Thanks! Are these cancelled
errors safe to retry? Also curious what changed in the gRPC versions between 1.7.3 and 1.8.14 as I can consistently reproduce the errors surfacing since that version upgrade, and no errors before that. The problem is our clients use non-idempotent APIs we provide such as increment
, which we only retry for unavailable
and internal
as they are safe. I am not sure if we can deem these as safe to retry? And if not, curious if you have thoughts on handling these situations. Because currently our clients experience these erorrs every time we have a service deployment.
This is from the framework that facilitates our server-side shutdown:
// According to http://httpwg.org/specs/rfc7540.html#GOAWAY:
//
// > A server that is attempting to gracefully shut down a connection
// > SHOULD send an initial GOAWAY frame with the last stream
// > identifier set to 2^31-1 and a NO_ERROR code. This signals to the
// > client that a shutdown is imminent and that initiating further
// > requests is prohibited. After allowing time for any in-flight
// > stream creation (at least one round-trip time), the server can
// > send another GOAWAY frame with an updated last stream identifier.
// > This ensures that a connection can be cleanly shut down without
// > losing requests.
self.inner.as_dyn().go_away(StreamId::MAX, Reason::NO_ERROR);
This results in a GOAWAY frame from the server to the client with http2 stream id 2147483647 and code 0. The next line that executes sets up the mechanism to stop accepting work. It sends a ping and then stops accepting streams pending receipt of the pong that follows the GOAWAY.
// We take the advice of waiting 1 RTT literally, and wait
// for a pong before proceeding.
self.inner.ping_pong.ping_shutdown();
There is a drain that accompanies this and keeps the connection alive while the old streams that are in progress are serviced. What I am not clear from the logs is which requests the client tried to work on using the older connection. I am also curious how other SDKs are handling this that doesn't result in cancelled
.
I was assuming your server was one of the gRPC implementations. If it isn't, then I can't say anything about what the server is doing. What we have are the client logs, and the client logs say that the client received a GOAWAY with code 8.
I should have been more specific - we do use tonic
on the server which is a gRPC implementation https://github.com/hyperium/tonic. Do you have insight on a/ retrying or handling these errors and b/ difference between the older versions of grpc-js and other SDKs which are not throwing these errors every deployment. Can also confirm the server sends a GOAWAY with code 0
as I see the below log on some of the runs, but not every time (and not on the ones I sent above):
D 2024-03-13T21:58:37.363Z | v1.10.0 49477 | transport | (16) 35.82.79.231:443 connection closed by GOAWAY with code 0 and data undefined
I also should have been more specific. Tonic is an implementation of gRPC, but it is not one of the official gRPC implementations that we maintain/support.
I understand that the server sends a GOAWAY with code 0. I am saying that it also sends a GOAWAY with code 8.
Cancellation errors cannot be retried in general, because the client doesn't know how much of the request the server processed.
There isn't enough information in the logs to determine why you see different behavior in other libraries and older versions of Node gRPC. We see a small snippet of what the client sees, and we see that it gets a GOAWAY with code 8 and acts accordingly.
In case the problem you are experiencing is related to #2625, I recommend testing again with grpc-js 1.10.4.
I tried with that but still see the cancelled
errors so not totally related to that issue.
Problem description
During our service deployments, the client is rejecting certain amount of requests with
CANCELLED
errors. Our servers are configured to send aGOAWAY
frame to clients during deployments to gracefully terminate connections. On May 11 2023, we had bumped our gRPC-js version from1.7.3
to1.8.14
. We reproduced our deployment scenario on multiple versions of our SDK, and the aforementioned gRPC version bump is the one where we start seeing the errors during deployments. We are not seeing this behavior with other SDKs.I haven't bisected the gRPC-js versions themselves in between those version ranges, and I do see that there are 3k commits between the versions so hard to pinpoint a specific commit. I also see an open issue related to
"New streams cannot be created after receiving a GOAWAY"
but that's one part of the problem we witness. There's another ticket that seemed relevant. On our logs, we see two flavors of errors.One is when the
transport
layer of gRPC detects aGOAWAY
frame and logs it, but still is followed by a fewCANCELLED
errors. The other flavor is when thetransport
layer doesn't detect the frame explicitly, but we seenew stream cannot be created after GOAWAY
messages, followed by more cancelled errors. Sometimes we see both of these behaviors.Here's a logStream from when we see both of them with http and gRPC traces:
with_cancelled_new_stream_errors.log
The below is a logStream where we don't see any
cancelled
ornew stream
errors from a version of gRPC earlier to1.8.14
. Notice that these do reportINTERNAL
errors sometimes (not as frequently or high ascancelled
), but even that can be safely retried so isn't an issue. If there's a way where we can know if thecancelled
errors are safe to retry for non-idempotent operations, it will be awesome.without_cancelled_new_stream_errors_old_grpc_version.log
Let me know if you need more logs! I have attached as files to not clutter this space. I can easily reproduce this in my setup so will be happy to add more traces/scenarios that you'd like!
Reproduction steps
Environment
1.8.14