grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.46k stars 3.85k forks source link

"io.grpc.StatusRuntimeException: UNKNOWN: channel closed" on network disconnect #10120

Open cocreature opened 1 year ago

cocreature commented 1 year ago

What version of gRPC-Java are you using?

1.44.0

What is your environment?

Ubuntu 22.04

What did you expect to see?

An UNAVAILABLE status code or something similar

What did you see instead?

We saw things fail with this exception and stacktrace:

io.grpc.StatusRuntimeException: UNKNOWN: channel closed
    at io.grpc.Status.asRuntimeException(Status.java:535)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:479)
    at io.opentelemetry.instrumentation.grpc.v1_6.TracingClientInterceptor$TracingClientCall$TracingClientCallListener.onClose(TracingClientInterceptor.java:161)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
    at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
    at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    at com.daml.executors.QueueAwareExecutorService$TrackingRunnable.run(QueueAwareExecutorService.scala:98)
    at com.daml.metrics.InstrumentedExecutorServiceMetrics$InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorServiceMetrics.scala:202)
    at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.nio.channels.ClosedChannelException: null
    at io.grpc.netty.Utils.statusFromThrowable(Utils.java:271)
    at io.grpc.netty.NettyClientHandler.onConnectionError(NettyClientHandler.java:500)
    at io.netty.handler.codec.http2.Http2ConnectionHandler.onError(Http2ConnectionHandler.java:641)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders0(DefaultHttp2ConnectionEncoder.java:251)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:167)
    at io.netty.handler.codec.http2.DecoratingHttp2FrameWriter.writeHeaders(DecoratingHttp2FrameWriter.java:53)
    at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:170)
    at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:158)
    at io.grpc.netty.NettyClientHandler.createStreamTraced(NettyClientHandler.java:609)
    at io.grpc.netty.NettyClientHandler.createStream(NettyClientHandler.java:592)
    at io.grpc.netty.NettyClientHandler.write(NettyClientHandler.java:326)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
    at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
    at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
    at io.netty.channel.DefaultChannelPipeline.write(DefaultChannelPipeline.java:1015)
    at io.netty.channel.AbstractChannel.write(AbstractChannel.java:301)
    at io.grpc.netty.WriteQueue$AbstractQueuedCommand.run(WriteQueue.java:213)
    at io.grpc.netty.WriteQueue.flush(WriteQueue.java:128)
    at io.grpc.netty.WriteQueue.access$000(WriteQueue.java:34)
    at io.grpc.netty.WriteQueue$1.run(WriteQueue.java:46)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.channel.StacklessClosedChannelException: null
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

Interestingly, it does look like the channel recovered from this after the connection established again.

Steps to reproduce the bug

In our test setup, we kill the connection with toxiproxy and then see this failure but only relatively rarely. I don't have a reliable reproduce unfortunately (nor one that I can make public).

Is that expected? Given that it recovers should we just retry on UNKNOWN: channel closed like we do on an UNAVAILABLE?

YifeiZhuang commented 1 year ago

~The exception looks expected behaviour. The error needs to be handled on your streamObserver/callListener, a retry sounds fair. A retry policy in grpc may help to do that automatically: see retry example.~

ejona86 commented 1 year ago

I agree that disconnects are generally UNAVAILABLE. The problem in this case is "you shouldn't have seen that error." It could have been normal, or it could have been an internal error.

https://github.com/grpc/grpc-java/commit/9bdb8f005a392dc358b59e0f3b0e4b34bec222c0 is working, and we have some stacktrace information to go on. But it is rather strange that we learn that the connection is closed when trying to write headers.

You are using plain-text (no TLS)?

cocreature commented 1 year ago

Yes this is plaintext.

ejona86 commented 1 year ago

I think there's enough here to try and make educated guesses on ways that might improve it. But it will be troublesome without a reproduction. It is timing-based, so a reproduction will be probabilistic.

@cocreature, could you share a toxiproxy configuration that kills the connection? I'm not already familiar with it, but it seems great for trying to trigger this. Are you using reset_peer?

cocreature commented 1 year ago

Let me describe our toxiproxy config as best as I can: We're using toxiproxy-server version 2.5.0 and the toxiproxy java client.

We're creating the proxy within our tests through client.createProxy(name, listenAddress, upstreamAddress) where client is created through new ToxiproxyClient().

Our client (which is throwing the error) connects to the listen address of the proxy. The actual server is behind the upstream address.

To break the connection we use proxy.disable().

That calls https://github.com/shopify/toxiproxy#down.

We don't use reset_peer.

ejona86 commented 1 year ago

Looks like down just closes the connection, without setting linger to 0 (as used by reset).

marx-freedom commented 1 year ago

Getting same exception. Unfortunately, we have only trimmed stacktrace in our logs.

io.grpc.StatusRuntimeException: UNKNOWN: channel closed
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:275)
Caused by:
java.nio.channels.ClosedChannelException
        at io.grpc.netty.Utils.statusFromThrowable(Utils.java:275)
Caused by:
io.netty.channel.StacklessClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
Caused by:
io.netty.channel.unix.Errors$NativeIoException: sendAddress(..) failed: Connection reset by peer
ejona86 commented 1 year ago

@marx-freedom, your issue seems unrelated to this. File a separate issue if you need to discuss it. In your case gRPC did know why the connection was closed: "Connection reset by peer."