Open TimWSpence opened 1 year ago
This should be a unit test, but I think we've got a bigger problem than performance with that parallel dispatcher: https://github.com/http4s/http4s-servlet/commit/4c9eb647f5525c0f7c299446d19244866f708185. Specifically, I'm seeing it reading an empty body!
[Edit: no, sequential is failing that way, too. I think I need to look at this with fresh eyes tomorrow.]
This looks like major progress, though I'm awfully disappointed that it still falls so far short. We may need to revert the whole queue concept.
Yeah it's a shame because the new code is much easier to understand. I wonder if part of the problem is that we're effectively double-queuing each chunk? Because there's a queue inside the sequential dispatcher as well.
The unsafeRunAndForget does seem dubious here with the parallel dispatcher. I'm now concerned we may have a critical correctness bug in the current version.
Yeah it seems like it shouldn't work to me. But therre may be a critical correctness bug in my understanding 😅
Also thanks very much for fixing the spacing on the benchmarks! 😂
I went ahead and made it a per-request dispatcher. Quick-and-dirty as of 9210893:
[info] Benchmark (iters) (size) Mode Cnt Score Error Units
[info] ServletIoBenchmarks.reader 1000 100000 thrpt 3 828.621 ± 15.904 ops/s
[info] ServletIoBenchmarks.requestBody 1000 100000 thrpt 3 56.251 ± 10.994 ops/s
Here's where requestBody
is spending its time:
[info] 3.1% 28.4% sun.misc.Unsafe.unpark
[info] 2.2% 20.3% java.lang.Object.hashCode
[info] 1.5% 13.1% sun.misc.Unsafe.park
[info] 0.9% 8.4% java.io.InputStream.read
[info] 0.8% 6.9% cats.effect.IOPlatform.<init>
[info] 0.8% 6.8% cats.effect.IOFiber.runLoop
[info] 0.2% 2.2% cats.effect.IOFiber.succeeded
[info] 0.2% 1.4% cats.effect.IO.<init>
[info] 0.1% 1.2% java.util.Random.nextBytes
[info] 0.1% 1.1% cats.effect.kernel.Resource.loop$1
[info] 1.1% 10.1% <other>
This is reader
:
[info] 1.8% 19.8% org.http4s.servlet.NonBlockingServletIo.org$http4s$servlet$NonBlockingServletIo$$read$1
[info] 1.7% 19.4% java.util.Random.nextBytes
[info] 1.4% 15.5% cats.effect.IOFiber.runLoop
[info] 1.0% 10.6% cats.effect.IOPlatform.<init>
[info] 0.4% 4.3% cats.effect.IOFiber.succeeded
[info] 0.3% 2.8% cats.effect.IO$$anon$4.delay
[info] 0.2% 2.7% cats.effect.IO.flatMap
[info] 0.2% 2.0% cats.effect.kernel.Resource$$Lambda$19/1133484341.<init>
[info] 0.2% 1.9% fs2.Pull$.fs2$Pull$$viewL$1
[info] 0.1% 1.6% cats.effect.IO$.pure
[info] 1.8% 19.5% <other>
Note that above stack profile is on Java 8. We get a different picture on Java 17. We should be targeting 8 with the library for maximal support, but we should be optimizing for 17.
requestBody
stack:
[info] 7.7% 44.0% <stack is empty, everything is filtered?>
[info] 2.6% 14.7% cats.effect.IOFiber.runLoop
[info] 1.4% 8.2% jdk.internal.misc.Unsafe.unpark
[info] 0.9% 5.1% java.util.Random.nextBytes
[info] 0.7% 4.1% cats.effect.unsafe.ScalQueue.poll
[info] 0.7% 4.1% cats.effect.IOFiber.succeeded
[info] 0.3% 1.5% cats.effect.kernel.Resource.loop$1
[info] 0.2% 1.2% cats.effect.std.Dispatcher.unsafeRunAndForget
[info] 0.2% 1.2% cats.effect.IO.<init>
[info] 0.2% 1.0% cats.Monad.$anonfun$map$1
[info] 2.6% 14.8% <other>
and
[info] ServletIoBenchmarks.reader 1000 100000 thrpt 3 353.705 ± 580.082 ops/s
[info] ServletIoBenchmarks.requestBody 1000 100000 thrpt 3 77.763 ± 2.734 ops/s
Fascinating that reader
got worse, and I don't know WTF that error margin is about.
I started a Pull-based implementation that avoids the queues and quickly ran into about a 3x performance loss. If I change requestBody
to be equivalent to reader
, so that the only difference is creating and releasing an (unused in this impl) sequential Dispatcher per request, we are already in a mess!
[info] ServletIoBenchmarks.reader 1000 100000 thrpt 2 470.326 ops/s
[info] ServletIoBenchmarks.requestBody 1000 100000 thrpt 2 153.507 ops/s
What I don't know is how much a decline in these benchmarks translates to a decline in real-world throughput. But I don't think we can compete with reader
with a dispatcher-per-request.
@rossabaker sorry I let this fall off my radar. Is there anything I can do to help get this whole performance thing over the line? I wasn't sure if you were planning to carry on with this PR or revert to the previous implementation
I got distracted by life and a couple other projects this week. Yeah... the current one is not only not performant, but strikingly incorrect. My inclination is that we should revert without breaking binary compatibility, and meanwhile keep exploring a way to clean it up and remain performant.
Alright, we can release #175 to undo the damage, and then keep exploring better ways via this PR.
Background
We merged some dependency updates and noticed a spike in CPU usage and in GC frequency/duration. We managed to narrow it down to
http4s-servlet:0.23.12
(.13
is also affected). With some git bisection and making local snapshots we managed to pin it down to https://github.com/http4s/http4s-servlet/pull/50.Changes in this PR
This PR adds a benchmark to compare the performance of
requestBody
andreader
and makes some suggested improvements torequestBody
off the back of that. The benchmarks do show a very significant reduction in throughput and increase in allocations when usingrequestBody
. The most significant thing that I found was that therequestBody
benchmark was allocating an enormous number ofint[]
(35-40% of allocations) because we are spawning 3 additional fibers per chunk - one in the parallelDispatcher
and two because the use ofStream#concurrently
means thatPull.eval(q.take)
is an interruptible eval.Conequently I made the following changes:
Dispatcher
(I was actually confused how the current code works as there are no ordering guarantees forq.submit(chunk).unsafeRunAndForget
when running on the parallelDispatcher
?) to eliminate one fiber allocation. I'm only forcing sequential in the benchmarks thus far but this very much supports https://github.com/http4s/http4s-servlet/pull/170concurrently
withflatMap
so thatPull.eval(q.take)
is no longer interruptible and hence eliminate the other fiber allocationsrequestBody
still gives ~50% less throughput and allocates about 10X more, I thought that every little helps.Benchmark results
Current version
This PR
[Edited by @rossabaker to make the benchmarks monospace]