Closed i9 closed 4 years ago
The underlying issue is in swift-nio-http2, as I discussed in apple/swift-nio-http2#213. However, the fix there is complex, so we should consider whether we can work around the issue in grpc.
There are two possible workarounds that I can see. The first is to apply a queue in grpc to ensure that we flush data in the order we create the streams. This is annoying, but can work. The second is even easier: just consider StreamIDTooSmall
errors to be trivially retryable, and to always retry when we hit them.
@glbrntt, thoughts?
@Lukasa I like the simplicity of just retrying, but StreamIDTooSmall
is thrown at the connection level and I don't think there's currently a way to recover that information at the stream level (since the multiplexer doesn't route errors into the stream channel).
Unless I'm missing something we'd have to make changes to NIOHTTP2 (namely: add the streamID to the error and route that error into the relevant stream via the multiplexer) to allow for this. It'd at least enable other HTTP/2 users to work around this in the same way.
Hmm, yeah, that's gonna be trickier. Let's chat about this today.
@i9 just to provide an update: we decided to fix this in NIOHTTP2 rather than working around it in gRPC. The issue tracking this is: https://github.com/apple/swift-nio-http2/issues/214
@i9 just to provide an update: we decided to fix this in NIOHTTP2 rather than working around it in gRPC. The issue tracking this is: apple/swift-nio-http2#214
really appreciate the swift action 😉
This bug is easy reproduce on iOS device (on Simulator this happens very rarely) using NIOTSEventLoopGroup
I launched two simple RPC calls one after another.
First returns with result, second freezes and after some amount of time fails future with error:
2020-07-31 10:52:02.075090+0400 [] tcp_input [C2:1] flags=[R.] seq=3741818499, ack=2511433662, win=2056 state=ESTABLISHED rcv_nxt=3741818499, snd_una=2511433662
2020-07-31 10:52:04.167720+0400 [] tcp_output [C2:1] flags=[R.] seq=2511433662, ack=3741818499, win=1028 state=CLOSED rcv_nxt=3741818499, snd_una=2511433662
2020-07-31 10:52:04.169659+0400 [] nw_read_request_report [C2] Receive failed with error "Operation timed out"
2020-07-31 10:52:04.170263+0400 [] sb_empty_assert [C2:1] sbflush: sb 0x1090369d0 so 0x109036850 cc 0 mb 0x1090369e0
2020-07-31 10:52:04.171069+0400 [] sb_empty_assert [C2:1] sbflush: sb 0x1090369d0 so 0x109036850 cc 0 mb 0x1090369e0, dumping backtrace:
[arm64] libnetcore-1880.120.4
0 libnetwork.dylib 0x000000018eb68e94 __nw_create_backtrace_string + 116
1 libusrtcp.dylib 0x000000019184de40 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 273984
2 libusrtcp.dylib 0x000000019184cfe8 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 270312
3 libusrtcp.dylib 0x000000019184c2f8 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 267000
4 libusrtcp.dylib 0x0000000191851438 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 287800
5 libusrtcp.dylib 0x000000019182aa58 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 129624
6 libdispatch.dylib 0x0000000108986338 _dispatch_call_block_and_release + 24
7 libdispatch.dylib 0x0000000108987730 _dispatch_client_callout + 16
8 libdispatch.dylib 0x000000010899081c _dispatch_workloop_invoke + 2480
9 libdispatch.dylib 0x000000010899a6c4 _dispatch_workloop_worker_thread + 1304
10 libsystem_pthread.dylib 0x000000018c21cb74 _pthread_wqthread + 272
11 libsystem_pthread.dylib 0x000000018c21f740 start_wqthread + 8
2020-07-31T10:52:04+0400 error: error=POSIXErrorCode: Operation timed out connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 grpc client error
2020-07-31T10:52:04+0400 info: old_state=ready connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 new_state=transientFailure connectivity state change
2020-07-31T10:52:04+0400 error: request_id=D83D0506-D423-450D-B1BD-E36F04BEEE2F connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 error=unavailable (14) call_state=active grpc client error
@givip There is no evidence in that log of this error occurring. Is there any evidence elsewhere?
@Lukasa What you are considering as evidence? I thought, last 3 lines of log mean the same error as @i9 described. May be needed some example project with minimum code to reproduce?
This log:
2020-07-27T15:19:49+0800 error: connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=StreamIDTooSmall() grpc client error
Note the presence of the string StreamIDTooSmall()
in the log, which marks it as this error.
@Lukasa I don't have this error's mention, but it seems that bug behavior is the same. I start two simultaneous requests and the second never performs, after some amount of time if fails with error
The symptom may be the same but without that log it cannot be this issue.
@Lukasa My issue is exactly #901 . Sorry
Describe the bug
grpc-swift version 1.0.0-alpha.17, swift-nio-http2 from: "1.12.1"
grpc-swift client(iOS App) --- nginx --- grpc server
client sends 3 different rpcs almost at the same time, one failed due to StreamIDTooSmall, the other two failed due to unavailable. error log:
But from nginx log, 2 out of 3 responses are returned successfully. one had error due to
client prematurely closed connection while processing HTTP/2 connection
which we assume is due to client side close and reconnect.we suspect in swift-nio-http2 createStreamChannel,
self.nextOutboundStreamID = HTTP2StreamID(Int32(streamID) + 2)
isn't thread safe, but we couldn't find any doc or guidance what to do.To reproduce
Steps to reproduce the bug you've found:
Expected behaviour
All requests succeeded.
Additional information
about 15% of our users are seeing the error. If it's the api caller's responsibility to avoid concurrent requests, we would like to see the doc clearly mention that.
I also filed https://github.com/apple/swift-nio-http2/issues/213 but unsure which repo will fix the issue (if it's indeed a bug)