http4s / blaze

Blazing fast NIO microframework and Http Parser
Apache License 2.0
349 stars 63 forks source link

WritePendingException: null from client #648

Open rossabaker opened 5 years ago

rossabaker commented 5 years ago

I can sometimes hit this evalMapping a basic GET request 20,000 times.

2018-11-15 02:30:44 ERROR [scala-execution-context-global-346] [] [Http1Connection.scala:153] - Error rendering request
java.nio.channels.WritePendingException: null
    at sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:352)
    at sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:416)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.org$http4s$blaze$channel$nio2$ByteBufferHead$$go$1(ByteBufferHead.scala:42)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.writeRequest(ByteBufferHead.scala:58)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.writeRequest(ByteBufferHead.scala:25)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.writeRequest(ByteBufferHead.scala:15)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:96)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:94)
    at org.http4s.client.blaze.ReadBufferStage.channelWrite(ReadBufferStage.scala:13)
    at org.http4s.client.blaze.ReadBufferStage.writeRequest(ReadBufferStage.scala:20)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:96)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:94)
    at org.http4s.blazecore.ResponseHeaderTimeoutStage.channelWrite(ResponseHeaderTimeoutStage.scala:13)
    at org.http4s.blazecore.ResponseHeaderTimeoutStage.writeRequest(ResponseHeaderTimeoutStage.scala:39)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:96)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:94)
    at org.http4s.client.blaze.Http1Connection.channelWrite(Http1Connection.scala:24)
    at org.http4s.blazecore.util.CachingChunkWriter.doWriteEnd(CachingChunkWriter.scala:74)
    at org.http4s.blazecore.util.CachingChunkWriter.writeEnd(CachingChunkWriter.scala:55)

/cc @ngbinh, who also reported it

ngbinh commented 5 years ago

In our case, it happens when we run our integ suites (probably hard to reproduce)

rossabaker commented 5 years ago

I have a pretty easy reproduction on this one. Not one I'm sure I want to run on Travis, but it shows the bug.

ngbinh commented 5 years ago

That's great

rossabaker commented 5 years ago

I have confirmed that this is not the timeout stages. I stripped them all away and it still happens. I now suspect that it is related to the simultaneous rendering of the request with receiving of the response. I think we need to figure out how to formally cancel the request writer and await its cancellation before recycling.

cquiroz commented 4 years ago

I observe this with a bit of high traffic using websockets using 0.21.0-RC1. Was there ever a fix?

rossabaker commented 4 years ago

Not yet.

kczulko commented 4 years ago

Hi,

I am facing the same issue. Here is my stack trace (sorry I cannot do more on it right now :/ ):

http4s version: 0.21.0

20/04/16 06:10:50 ERROR blaze.Http1Connection: Error rendering request
java.nio.channels.WritePendingException
    at sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:352)
    at sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:416)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.org$http4s$blaze$channel$nio2$ByteBufferHead$$go$1(ByteBufferHead.scala:42)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.writeRequest(ByteBufferHead.scala:58)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:109)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:107)
    at org.http4s.client.blaze.ReadBufferStage.channelWrite(ReadBufferStage.scala:13)
    at org.http4s.client.blaze.ReadBufferStage.writeRequest(ReadBufferStage.scala:21)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:109)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:107)
    at org.http4s.blazecore.IdleTimeoutStage.channelWrite(IdleTimeoutStage.scala:12)
    at org.http4s.blazecore.IdleTimeoutStage.writeRequest(IdleTimeoutStage.scala:46)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:109)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:107)
    at org.http4s.client.blaze.Http1Connection.channelWrite(Http1Connection.scala:25)
    at org.http4s.blazecore.util.IdentityWriter.writeBodyChunk(IdentityWriter.scala:56)
    at org.http4s.blazecore.util.EntityBodyWriter.$anonfun$writePipe$3(EntityBodyWriter.scala:64)
    at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
    at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
    at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:86)
    at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
    at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
    at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
    at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
    at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
    at cats.effect.internals.Trampoline.execute(Trampoline.scala:43)
    at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
    at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:72)
    at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:52)
    at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:136)
    at cats.effect.internals.IORunLoop$.start(IORunLoop.scala:34)
    at cats.effect.internals.IOBracket$.$anonfun$apply$1(IOBracket.scala:44)
    at cats.effect.internals.IOBracket$.$anonfun$apply$1$adapted(IOBracket.scala:34)
    at cats.effect.internals.IORunLoop$RestartCallback.start(IORunLoop.scala:341)
    at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:119)
    at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
    at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
    at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
    at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
    at cats.effect.internals.PoolUtils$$anon$2$$anon$3.run(PoolUtils.scala:51)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
caoilte commented 4 years ago

Can anyone explain the consequence of this exception getting logged? Does it cancel one or more running requests? The exception doesn't seem to bubble up so does that mean that an internal retry mechanism is catching it?

rossabaker commented 4 years ago

A fiber is started to write the request, and this message is logged when that fiber fails.

It's an interesting question: my inclination is that we should instead be canceling the request and raising an error. But the response to a partial request rendering might be part of the recovery. It's natural that the server can begin responding to a partially written request, and bad things can always happen when writing a request, even after fixing this bug. How do we best represent "this is the response we managed to get from an incorrectly rendered request?" I think the status quo is wrong, but I'm not sure what's right.

iRevive commented 4 years ago

Hello. I ran into the same issue. For some reason, I wasn't able to reproduce error locally, while in the production setup it appears constantly. Perhaps networking or platform (Linux) is a part of the problem.

Prerequisites

http4s version: 0.21.4 pool size: 1 idle timeout: Duration.Inf

Successful execution flow

[2020-06-22 19:39:36,806] DEBUG [PoolManager]: Requesting connection: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2020-06-22 19:39:36,806] DEBUG [PoolManager]: Recycling connection: curAllocated=1 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2020-06-22 19:39:36,806] DEBUG [Http1Connection]: Connection was idle. Running.
[2020-06-22 19:39:36,806] DEBUG [Http1Connection]: Beginning request: POST http://dns:8080/api/v1/datapoints
[2020-06-22 19:39:36,809] DEBUG [Http1Connection]: Resetting org.http4s.client.blaze.Http1Connection after completing request.
[2020-06-22 19:39:36,809] DEBUG [PoolManager]: Recycling connection: curAllocated=1 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2020-06-22 19:39:36,809] DEBUG [PoolManager]: Returning idle connection to pool: curAllocated=1 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false

Failed execution flow

[2020-06-22 19:39:36,811] DEBUG [PoolManager]: Requesting connection: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2020-06-22 19:39:36,811] DEBUG [PoolManager]: Recycling connection: curAllocated=1 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2020-06-22 19:39:36,811] DEBUG [Http1Connection]: Connection was idle. Running.
[2020-06-22 19:39:36,811] DEBUG [Http1Connection]: Beginning request: POST http://dns:8080/api/v1/datapoints
[2020-06-22 19:39:36,814] DEBUG [Http1Connection]: Resetting org.http4s.client.blaze.Http1Connection after completing request.
[2020-06-22 19:39:36,814] ERROR [Http1Connection]: Error rendering request

Exception

java.nio.channels.WritePendingException: null
    at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:353)
    at java.base/sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:417)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.org$http4s$blaze$channel$nio2$ByteBufferHead$$go$1(ByteBufferHead.scala:42)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.writeRequest(ByteBufferHead.scala:58)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.writeRequest(ByteBufferHead.scala:25)
    at org.http4s.blaze.channel.nio2.ByteBufferHead.writeRequest(ByteBufferHead.scala:16)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:96)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:94)
    at org.http4s.client.blaze.ReadBufferStage.channelWrite(ReadBufferStage.scala:14)
    at org.http4s.client.blaze.ReadBufferStage.writeRequest(ReadBufferStage.scala:20)
    at org.http4s.blaze.pipeline.Tail.channelWrite(Stages.scala:96)
    at org.http4s.blaze.pipeline.Tail.channelWrite$(Stages.scala:94)
    at org.http4s.client.blaze.Http1Connection.channelWrite(Http1Connection.scala:25)
    at org.http4s.blazecore.util.IdentityWriter.writeBodyChunk(IdentityWriter.scala:57)
    at org.http4s.blazecore.util.IdentityWriter.writeEnd(IdentityWriter.scala:64)
    at org.http4s.blazecore.util.EntityBodyWriter.$anonfun$writeEntityBody$1(EntityBodyWriter.scala:53)

The most interesting part

[2020-06-22 19:39:36,814] DEBUG [Http1Connection]: Resetting org.http4s.client.blaze.Http1Connection after completing request.
[2020-06-22 19:39:36,814] ERROR [Http1Connection]: Error rendering request

At this point, I cannot understand the logic behind these messages. Resetting ... after completing request. printed when reading stage (response) is completed. While WritePendingException thrown at the writing stage (request).

Can the writing stage be active along with the reading stage?

caoilte commented 4 years ago

@iRevive the problem went away completely once I rate-limited our use of blaze HTTP client. We would see it when the node was doing > 250 requests / second. Once I rate limited the service to 200 requests / second we stopped seeing the problem completely. (I used https://github.com/SystemFw/upperbound to do the rate limiting.)

nigredo-tori commented 3 years ago

Not sure how useful all of this will be... In one service we have a periodic process that queries a few other services once an hour. In particular, it sends two consecutive GET queries (over plain HTTP) to a service that doesn't see much action otherwise. On one of the instances (with both of the services running on Windows) we're seeing something like this around 80% of the time:

  1. java.nio.channels.WritePendingException: null in the log.
  2. 30 seconds (the request timeout) pass.
  3. We see the results of the requests. The first one has succeeded, the second one has timed out.

I've failed to reproduce this locally or get any extra data (I don't control this instance), so I'm at a loss. I can't understand why the timeout happens (as opposed to a retry), and why this only happens between these two services - even though there are more places with pretty much the same flow.

UPD: I've since seen this issue with other services. Also, the timeout behavior might be due to http4s/blaze#642. As a workaround, we've just switched to another client (Jetty).

kovacshuni commented 3 years ago

Any news on this? I'm also seeing a relation between the frequency of this error occurring and the speed at which I am making the requests. Higher performance, more prone to throw error.