TimWSpence commented 1 year ago

Background

We merged some dependency updates and noticed a spike in CPU usage and in GC frequency/duration. We managed to narrow it down to http4s-servlet:0.23.12 (.13 is also affected). With some git bisection and making local snapshots we managed to pin it down to https://github.com/http4s/http4s-servlet/pull/50.

Changes in this PR

This PR adds a benchmark to compare the performance of requestBody and reader and makes some suggested improvements to requestBody off the back of that. The benchmarks do show a very significant reduction in throughput and increase in allocations when using requestBody. The most significant thing that I found was that the requestBody benchmark was allocating an enormous number of int[] (35-40% of allocations) because we are spawning 3 additional fibers per chunk - one in the parallel Dispatcher and two because the use of Stream#concurrently means that Pull.eval(q.take) is an interruptible eval.

Conequently I made the following changes:

Use a sequential Dispatcher (I was actually confused how the current code works as there are no ordering guarantees for q.submit(chunk).unsafeRunAndForget when running on the parallel Dispatcher?) to eliminate one fiber allocation. I'm only forcing sequential in the benchmarks thus far but this very much supports https://github.com/http4s/http4s-servlet/pull/170
Replace concurrently with flatMap so that Pull.eval(q.take) is no longer interruptible and hence eliminate the other fiber allocations
Remove the boxing on the queue. I didn't see a huge difference with this but given that the benchmarks suggest that the improved requestBody still gives ~50% less throughput and allocates about 10X more, I thought that every little helps.

Benchmark results

Current version

Benchmark                                                         (iters)  (size)   Mode  Cnt         Score        Error   Units
ServletIoBenchmarks.reader                                           1000  100000  thrpt   20       206.176 ±      2.882   ops/s
ServletIoBenchmarks.reader:·gc.alloc.rate                            1000  100000  thrpt   20       514.322 ±      7.181  MB/sec
ServletIoBenchmarks.reader:·gc.alloc.rate.norm                       1000  100000  thrpt   20   2748150.063 ±      5.020    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space                   1000  100000  thrpt   20       511.571 ±      8.988  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space.norm              1000  100000  thrpt   20   2733572.797 ±  37047.601    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space               1000  100000  thrpt   20         0.032 ±      0.010  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space.norm          1000  100000  thrpt   20       172.966 ±     52.647    B/op
ServletIoBenchmarks.reader:·gc.count                                 1000  100000  thrpt   20       668.000               counts
ServletIoBenchmarks.reader:·gc.time                                  1000  100000  thrpt   20       361.000                   ms

ServletIoBenchmarks.requestBody                                      1000  100000  thrpt   20        39.742 ±      0.779   ops/s
ServletIoBenchmarks.requestBody:·gc.alloc.rate                       1000  100000  thrpt   20      2371.828 ±     46.455  MB/sec
ServletIoBenchmarks.requestBody:·gc.alloc.rate.norm                  1000  100000  thrpt   20  65741530.766 ±   2074.704    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space              1000  100000  thrpt   20      2396.249 ±     47.821  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space.norm         1000  100000  thrpt   20  66418490.614 ± 273169.904    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space          1000  100000  thrpt   20         0.029 ±      0.012  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space.norm     1000  100000  thrpt   20       806.306 ±    327.324    B/op
ServletIoBenchmarks.requestBody:·gc.count                            1000  100000  thrpt   20      1886.000               counts
ServletIoBenchmarks.requestBody:·gc.time                             1000  100000  thrpt   20       926.000                   ms

This PR

Benchmark                                                         (iters)  (size)   Mode  Cnt         Score       Error   Units
ServletIoBenchmarks.reader                                           1000  100000  thrpt   20       188.385 ±     3.473   ops/s
ServletIoBenchmarks.reader:·gc.alloc.rate                            1000  100000  thrpt   20       470.069 ±     8.604  MB/sec
ServletIoBenchmarks.reader:·gc.alloc.rate.norm                       1000  100000  thrpt   20   2749302.659 ±  1382.197    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space                   1000  100000  thrpt   20       468.061 ±    10.620  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space.norm              1000  100000  thrpt   20   2737395.458 ± 25786.814    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space               1000  100000  thrpt   20         0.026 ±     0.009  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space.norm          1000  100000  thrpt   20       150.846 ±    54.054    B/op
ServletIoBenchmarks.reader:·gc.count                                 1000  100000  thrpt   20       611.000              counts
ServletIoBenchmarks.reader:·gc.time                                  1000  100000  thrpt   20       357.000                  ms

ServletIoBenchmarks.requestBody                                      1000  100000  thrpt   20       104.962 ±     0.326   ops/s
ServletIoBenchmarks.requestBody:·gc.alloc.rate                       1000  100000  thrpt   20      1942.809 ±     6.386  MB/sec
ServletIoBenchmarks.requestBody:·gc.alloc.rate.norm                  1000  100000  thrpt   20  20396363.510 ±  8978.371    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space              1000  100000  thrpt   20      1960.285 ±    11.354  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space.norm         1000  100000  thrpt   20  20579794.985 ± 91489.964    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space          1000  100000  thrpt   20         0.022 ±     0.007  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space.norm     1000  100000  thrpt   20       234.842 ±    70.506    B/op
ServletIoBenchmarks.requestBody:·gc.count                            1000  100000  thrpt   20      1545.000              counts
ServletIoBenchmarks.requestBody:·gc.time                             1000  100000  thrpt   20       739.000                  ms

[Edited by @rossabaker to make the benchmarks monospace]

rossabaker commented 1 year ago

This should be a unit test, but I think we've got a bigger problem than performance with that parallel dispatcher: https://github.com/http4s/http4s-servlet/commit/4c9eb647f5525c0f7c299446d19244866f708185. Specifically, I'm seeing it reading an empty body!

[Edit: no, sequential is failing that way, too. I think I need to look at this with fresh eyes tomorrow.]

TimWSpence commented 1 year ago

This looks like major progress, though I'm awfully disappointed that it still falls so far short. We may need to revert the whole queue concept.

Yeah it's a shame because the new code is much easier to understand. I wonder if part of the problem is that we're effectively double-queuing each chunk? Because there's a queue inside the sequential dispatcher as well.

The unsafeRunAndForget does seem dubious here with the parallel dispatcher. I'm now concerned we may have a critical correctness bug in the current version.

Yeah it seems like it shouldn't work to me. But therre may be a critical correctness bug in my understanding 😅

Also thanks very much for fixing the spacing on the benchmarks! 😂

rossabaker commented 1 year ago

I went ahead and made it a per-request dispatcher. Quick-and-dirty as of 9210893:

[info] Benchmark                        (iters)  (size)   Mode  Cnt    Score    Error  Units
[info] ServletIoBenchmarks.reader          1000  100000  thrpt    3  828.621 ± 15.904  ops/s
[info] ServletIoBenchmarks.requestBody     1000  100000  thrpt    3   56.251 ± 10.994  ops/s

Here's where requestBody is spending its time:

[info]   3.1%  28.4% sun.misc.Unsafe.unpark
[info]   2.2%  20.3% java.lang.Object.hashCode
[info]   1.5%  13.1% sun.misc.Unsafe.park
[info]   0.9%   8.4% java.io.InputStream.read
[info]   0.8%   6.9% cats.effect.IOPlatform.<init>
[info]   0.8%   6.8% cats.effect.IOFiber.runLoop
[info]   0.2%   2.2% cats.effect.IOFiber.succeeded
[info]   0.2%   1.4% cats.effect.IO.<init>
[info]   0.1%   1.2% java.util.Random.nextBytes
[info]   0.1%   1.1% cats.effect.kernel.Resource.loop$1
[info]   1.1%  10.1% <other>

This is reader:

[info]   1.8%  19.8% org.http4s.servlet.NonBlockingServletIo.org$http4s$servlet$NonBlockingServletIo$$read$1
[info]   1.7%  19.4% java.util.Random.nextBytes
[info]   1.4%  15.5% cats.effect.IOFiber.runLoop
[info]   1.0%  10.6% cats.effect.IOPlatform.<init>
[info]   0.4%   4.3% cats.effect.IOFiber.succeeded
[info]   0.3%   2.8% cats.effect.IO$$anon$4.delay
[info]   0.2%   2.7% cats.effect.IO.flatMap
[info]   0.2%   2.0% cats.effect.kernel.Resource$$Lambda$19/1133484341.<init>
[info]   0.2%   1.9% fs2.Pull$.fs2$Pull$$viewL$1
[info]   0.1%   1.6% cats.effect.IO$.pure
[info]   1.8%  19.5% <other>

rossabaker commented 1 year ago

Note that above stack profile is on Java 8. We get a different picture on Java 17. We should be targeting 8 with the library for maximal support, but we should be optimizing for 17.

requestBody stack:

[info]   7.7%  44.0% <stack is empty, everything is filtered?>
[info]   2.6%  14.7% cats.effect.IOFiber.runLoop
[info]   1.4%   8.2% jdk.internal.misc.Unsafe.unpark
[info]   0.9%   5.1% java.util.Random.nextBytes
[info]   0.7%   4.1% cats.effect.unsafe.ScalQueue.poll
[info]   0.7%   4.1% cats.effect.IOFiber.succeeded
[info]   0.3%   1.5% cats.effect.kernel.Resource.loop$1
[info]   0.2%   1.2% cats.effect.std.Dispatcher.unsafeRunAndForget
[info]   0.2%   1.2% cats.effect.IO.<init>
[info]   0.2%   1.0% cats.Monad.$anonfun$map$1
[info]   2.6%  14.8% <other>

and

[info] ServletIoBenchmarks.reader              1000  100000  thrpt    3  353.705 ± 580.082  ops/s
[info] ServletIoBenchmarks.requestBody         1000  100000  thrpt    3   77.763 ±   2.734  ops/s

Fascinating that reader got worse, and I don't know WTF that error margin is about.

rossabaker commented 1 year ago

I started a Pull-based implementation that avoids the queues and quickly ran into about a 3x performance loss. If I change requestBody to be equivalent to reader, so that the only difference is creating and releasing an (unused in this impl) sequential Dispatcher per request, we are already in a mess!

[info] ServletIoBenchmarks.reader                 1000  100000  thrpt    2  470.326          ops/s
[info] ServletIoBenchmarks.requestBody            1000  100000  thrpt    2  153.507          ops/s

What I don't know is how much a decline in these benchmarks translates to a decline in real-world throughput. But I don't think we can compete with reader with a dispatcher-per-request.

TimWSpence commented 1 year ago

@rossabaker sorry I let this fall off my radar. Is there anything I can do to help get this whole performance thing over the line? I wasn't sure if you were planning to carry on with this PR or revert to the previous implementation

rossabaker commented 1 year ago

I got distracted by life and a couple other projects this week. Yeah... the current one is not only not performant, but strikingly incorrect. My inclination is that we should revert without breaking binary compatibility, and meanwhile keep exploring a way to clean it up and remain performant.

rossabaker commented 1 year ago

Alright, we can release #175 to undo the damage, and then keep exploring better ways via this PR.

http4s / http4s-servlet

Benchmarks and improve performance of `ServletIo#requestBody` #171

Background

Changes in this PR

Benchmark results

Current version

This PR