grpc / grpc-swift

The Swift language implementation of gRPC.
Apache License 2.0
2.01k stars 416 forks source link

Encounter: unavailable (14) Transport became inactive #1635

Closed xkam1x closed 1 year ago

xkam1x commented 1 year ago

Describe the bug

Unexpectedly running into 'unavailable (14) Transport became inactive' while making RPCs. Once it happens, subsequent RPCs can also fail but system recovers in under 1 minute.

I am using the following:

grpc-swift 1.18.0
swift-atomics 1.0.2
swift-collections 1.0.3
swift-log 1.4.4
swift-nio 2.43.1
swift-nio-extras 1.8.0
swift-nio-http2 1.27.0
swift-nio-ssl 2.23.1
swift-nio-transport-services 1.16.0
SwiftProtobuff 1.22.0

My setup is APP -> NGINX -> GRPC C++ Server.

To reproduce

I am using the following channel config:

        let eventLoopGroup = PlatformSupport.makeEventLoopGroup(loopCount: 1)
        var configuration = GRPCChannelPool.Configuration.with(target: .hostAndPort(AppDefaults.shared.serverAddress, 8080), transportSecurity: .plaintext, eventLoopGroup: eventLoopGroup)
        configuration.connectionBackoff = .init(initialBackoff: 1.0,
                                                maximumBackoff: 5.0,
                                                multiplier: 1.6,
                                                jitter: 0.2,
                                                minimumConnectionTimeout: 5.0,
                                                retries: .unlimited)
        configuration.idleTimeout = .minutes(1)
        configuration.keepalive = .init(interval: .seconds(2),
                                        timeout: .seconds(1),
                                        permitWithoutCalls: false,
                                        maximumPingsWithoutData: 2,
                                        minimumSentPingIntervalWithoutData: .seconds(5))
        configuration.maximumReceiveMessageLength = 16 * 1024 * 1024

        var log = Logger.init(label: "client")
        log.logLevel = .trace
        configuration.backgroundActivityLogger = log
        channel = try! GRPCChannelPool.with(configuration: configuration)

Additional information

When 'Transport became inactive' occurs NGINX logs report HTTP/2 status of 499 (Client dropped connection).

Logs just before RPC failure:

2023-07-26T12:40:04+0100 trace call : call_state=awaitingTransport (0 parts buffered) grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 request_part=metadata [GRPC] buffering request part
2023-07-26T12:40:04+0100 trace call : call_state=awaitingTransport (1 parts buffered) grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 request_part=message [GRPC] buffering request part
2023-07-26T12:40:04+0100 trace call : call_state=awaitingTransport (2 parts buffered) grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 request_part=end [GRPC] buffering request part
2023-07-26T12:40:04+0100 debug call : grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 [GRPC] activated stream channel
2023-07-26T12:40:04+0100 trace call : grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 request_parts=3 [GRPC] unbuffering request parts
2023-07-26T12:40:04+0100 trace call : grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 request_part=metadata [GRPC] unbuffering request part
2023-07-26T12:40:04+0100 trace call : grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 h2_end_stream=false h2_headers=[([], ":method", "POST"), ([], ":path", "/rias_app_data_api.AppDataService/GetSentences"), ([], ":authority", "192.168.1.140"), ([], ":scheme", "http"), ([], "content-type", "application/grpc"), ([], "te", "trailers"), ([], "grpc-timeout", "4999991u"), ([], "x-token", "e48e8bd602294de9ad671a57bd3e2c70"), ([], "user-agent", "grpc-swift-nio/1.18.0")] h2_payload=HEADERS [GRPC] writing HTTP2 frame
2023-07-26T12:40:04+0100 trace call : grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 request_part=message [GRPC] unbuffering request part
2023-07-26T12:40:04+0100 trace call : grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 request_part=end [GRPC] unbuffering request part
2023-07-26T12:40:04+0100 trace call : grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 h2_data_bytes=2026 h2_end_stream=false h2_payload=DATA [GRPC] writing HTTP2 frame
2023-07-26T12:40:04+0100 trace call : grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 h2_data_bytes=0 h2_end_stream=true h2_payload=DATA [GRPC] writing HTTP2 frame
2023-07-26T12:40:04+0100 debug client : grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_connection_id=C9CEAB2D-F819-4427-9A7C-DFF5805E7599/0 h2_active_streams=1 h2_stream_id=HTTP2StreamID(297) poolmanager.id=ObjectIdentifier(0x00006000002cdac0) [GRPC] HTTP2 stream created
2023-07-26T12:40:04+0100 trace call : grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 [GRPC] request buffer drained
2023-07-26T12:40:04+0100 trace call : call_state=closed grpc.conn.addr_local=192.168.1.180 grpc.conn.addr_remote=192.168.1.140 grpc_request_id=B2FF7070-F833-418F-987E-C58FA90C0AA6 [GRPC] failing buffered writes
2023-07-26T12:40:04+0100 debug client : connectivity_state=ready grpc_connection_id=C9CEAB2D-F819-4427-9A7C-DFF5805E7599/0 poolmanager.id=ObjectIdentifier(0x00006000002cdac0) [GRPC] deactivating connection
glbrntt commented 1 year ago

Interesting. We seem to be dropping the connection for some reason. Given the "client dropped connection" status, I wonder if this is https://github.com/grpc/grpc-swift/pull/1626

Are you able to reproduce this easily? If so could you try building with main?

xkam1x commented 1 year ago

After compiling the code against the main branch, I ran a stress test by making just under 50k RPCs and transferring just over 10GB of data. During the test, no errors were logged. This indicates that the issue, which was causing the error to be logged once every minute, has been successfully fixed.

I'm now looking forward to the next release that will incorporate this change. With the issue resolved, the upcoming release should provide a more stable and reliable experience. Thank you to the development team for their efforts in identifying and resolving the issue.

glbrntt commented 1 year ago

Awesome, thanks for confirming so quickly! We'll try to get the fix out soon :)

glbrntt commented 1 year ago

Sorry this took a little longer than hoped to get released, but it's now available in 1.19.0.