grpc / grpc-swift

The Swift language implementation of gRPC.
Apache License 2.0
2.01k stars 416 forks source link

concurrent requests cause StreamIDTooSmall error, channel state becomes transientFailure #912

Closed i9 closed 4 years ago

i9 commented 4 years ago

Describe the bug

grpc-swift version 1.0.0-alpha.17, swift-nio-http2 from: "1.12.1"

grpc-swift client(iOS App) --- nginx --- grpc server

client sends 3 different rpcs almost at the same time, one failed due to StreamIDTooSmall, the other two failed due to unavailable. error log:

2020-07-27T15:19:49+0800 error: connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=StreamIDTooSmall() grpc client error
2020-07-27T15:19:49+0800 info: new_state=transientFailure old_state=ready connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 connectivity state change
"gRPC state did change from ready to transientFailure"
2020-07-27T15:19:49+0800 error: request_id=9168A18B-3ED8-43F2-BCF5-188C1F68FF9A call_state=active connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=unavailable (14) grpc client error
2020-07-27T15:19:49+0800 error: call_state=active request_id=517D243A-8A9B-413C-97E9-7078E3866614 connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=unavailable (14) grpc client error

But from nginx log, 2 out of 3 responses are returned successfully. one had error due to client prematurely closed connection while processing HTTP/2 connection which we assume is due to client side close and reconnect.

we suspect in swift-nio-http2 createStreamChannel, self.nextOutboundStreamID = HTTP2StreamID(Int32(streamID) + 2) isn't thread safe, but we couldn't find any doc or guidance what to do.

To reproduce

Steps to reproduce the bug you've found:

  1. send concurrent grpc requests

Expected behaviour

All requests succeeded.

Additional information

about 15% of our users are seeing the error. If it's the api caller's responsibility to avoid concurrent requests, we would like to see the doc clearly mention that.

I also filed https://github.com/apple/swift-nio-http2/issues/213 but unsure which repo will fix the issue (if it's indeed a bug)

Lukasa commented 4 years ago

The underlying issue is in swift-nio-http2, as I discussed in apple/swift-nio-http2#213. However, the fix there is complex, so we should consider whether we can work around the issue in grpc.

There are two possible workarounds that I can see. The first is to apply a queue in grpc to ensure that we flush data in the order we create the streams. This is annoying, but can work. The second is even easier: just consider StreamIDTooSmall errors to be trivially retryable, and to always retry when we hit them.

@glbrntt, thoughts?

glbrntt commented 4 years ago

@Lukasa I like the simplicity of just retrying, but StreamIDTooSmall is thrown at the connection level and I don't think there's currently a way to recover that information at the stream level (since the multiplexer doesn't route errors into the stream channel).

Unless I'm missing something we'd have to make changes to NIOHTTP2 (namely: add the streamID to the error and route that error into the relevant stream via the multiplexer) to allow for this. It'd at least enable other HTTP/2 users to work around this in the same way.

Lukasa commented 4 years ago

Hmm, yeah, that's gonna be trickier. Let's chat about this today.

glbrntt commented 4 years ago

@i9 just to provide an update: we decided to fix this in NIOHTTP2 rather than working around it in gRPC. The issue tracking this is: https://github.com/apple/swift-nio-http2/issues/214

i9 commented 4 years ago

@i9 just to provide an update: we decided to fix this in NIOHTTP2 rather than working around it in gRPC. The issue tracking this is: apple/swift-nio-http2#214

really appreciate the swift action 😉

givip commented 4 years ago

This bug is easy reproduce on iOS device (on Simulator this happens very rarely) using NIOTSEventLoopGroup I launched two simple RPC calls one after another. First returns with result, second freezes and after some amount of time fails future with error:

2020-07-31 10:52:02.075090+0400  [] tcp_input [C2:1] flags=[R.] seq=3741818499, ack=2511433662, win=2056 state=ESTABLISHED rcv_nxt=3741818499, snd_una=2511433662
2020-07-31 10:52:04.167720+0400  [] tcp_output [C2:1] flags=[R.] seq=2511433662, ack=3741818499, win=1028 state=CLOSED rcv_nxt=3741818499, snd_una=2511433662
2020-07-31 10:52:04.169659+0400  [] nw_read_request_report [C2] Receive failed with error "Operation timed out"
2020-07-31 10:52:04.170263+0400  [] sb_empty_assert [C2:1] sbflush: sb 0x1090369d0 so 0x109036850 cc 0 mb 0x1090369e0
2020-07-31 10:52:04.171069+0400  [] sb_empty_assert [C2:1] sbflush: sb 0x1090369d0 so 0x109036850 cc 0 mb 0x1090369e0, dumping backtrace:
        [arm64] libnetcore-1880.120.4
    0   libnetwork.dylib                    0x000000018eb68e94 __nw_create_backtrace_string + 116
    1   libusrtcp.dylib                     0x000000019184de40 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 273984
    2   libusrtcp.dylib                     0x000000019184cfe8 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 270312
    3   libusrtcp.dylib                     0x000000019184c2f8 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 267000
    4   libusrtcp.dylib                     0x0000000191851438 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 287800
    5   libusrtcp.dylib                     0x000000019182aa58 049DC0B2-A42A-3EE3-B536-597E51BBA17C + 129624
    6   libdispatch.dylib                   0x0000000108986338 _dispatch_call_block_and_release + 24
    7   libdispatch.dylib                   0x0000000108987730 _dispatch_client_callout + 16
    8   libdispatch.dylib                   0x000000010899081c _dispatch_workloop_invoke + 2480
    9   libdispatch.dylib                   0x000000010899a6c4 _dispatch_workloop_worker_thread + 1304
    10  libsystem_pthread.dylib             0x000000018c21cb74 _pthread_wqthread + 272
    11  libsystem_pthread.dylib             0x000000018c21f740 start_wqthread + 8
2020-07-31T10:52:04+0400 error: error=POSIXErrorCode: Operation timed out connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 grpc client error
2020-07-31T10:52:04+0400 info: old_state=ready connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 new_state=transientFailure connectivity state change
2020-07-31T10:52:04+0400 error: request_id=D83D0506-D423-450D-B1BD-E36F04BEEE2F connection_id=F5DA533B-CAF8-4453-B2AC-81F97D161815/0 error=unavailable (14) call_state=active grpc client error
Lukasa commented 4 years ago

@givip There is no evidence in that log of this error occurring. Is there any evidence elsewhere?

givip commented 4 years ago

@Lukasa What you are considering as evidence? I thought, last 3 lines of log mean the same error as @i9 described. May be needed some example project with minimum code to reproduce?

Lukasa commented 4 years ago

This log:

2020-07-27T15:19:49+0800 error: connection_id=4003FC27-24A7-4B11-B041-37FDCA5E35C1/0 error=StreamIDTooSmall() grpc client error

Note the presence of the string StreamIDTooSmall() in the log, which marks it as this error.

givip commented 4 years ago

@Lukasa I don't have this error's mention, but it seems that bug behavior is the same. I start two simultaneous requests and the second never performs, after some amount of time if fails with error

Lukasa commented 4 years ago

The symptom may be the same but without that log it cannot be this issue.

givip commented 4 years ago

@Lukasa My issue is exactly #901 . Sorry