Benchmark polling system powered Ember and friends

armanbilge commented 1 year ago

I've published a snapshot of FS2 I/O that integrates with the new polling system (based on CE 3.6-e9aeb8c).

libraryDependencies += "co.fs2" %% "fs2-io" % "3.8-1af22dd"

You can drop-in this dependency to any application using the latest stable versions of http4s Ember, Skunk, etc. Please try it out and report back!! We would really appreciate it 😊

Follow-up to:

armanbilge commented 1 year ago

CI is green in both http4s and Skunk.

armanbilge commented 1 year ago

I wrote the laziest server 😅

//> using dep co.fs2::fs2-io::3.8-1af22dd
//> using dep org.http4s::http4s-ember-server::0.23.20

import cats.effect.*
import org.http4s.ember.server.EmberServerBuilder

object App extends IOApp.Simple:
  def run = EmberServerBuilder.default[IO].build.useForever

Then I benchmarked it with hey (after some warmup). Looks like a 20% improvement!

FS2 3.7.0

Summary:
  Total:        30.0515 secs
  Slowest:      0.0634 secs
  Fastest:      0.0001 secs
  Average:      0.0028 secs
  Requests/sec: 17845.4902

FS2 3.8-1af22dd

Summary:
  Total:        30.0243 secs
  Slowest:      0.1221 secs
  Fastest:      0.0001 secs
  Average:      0.0023 secs
  Requests/sec: 21782.6589

armanbilge commented 1 year ago

Thanks to @ChristopherDavenport for doing some additional benchmarking! tl;dr Ember has now surpassed Blaze 🔥

https://gist.github.com/ChristopherDavenport/2e5ad15cd293aa0816090f8677b8cc3b

danicheg commented 1 year ago

w00t! 🔥

Ember has now surpassed Blaze 🔥

On throughput but not latency, right? Also, has something changed to Ember's latency? cc @ChristopherDavenport Or maybe I'm just misreading the results...

Ember Before

Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.19ms   17.05ms 487.71ms   98.89%
    Req/Sec     5.05k     2.51k   11.17k    61.68%

Ember After

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.98ms   55.97ms   1.30s    98.65%
    Req/Sec     7.06k     3.41k   14.89k    66.11%

ChristopherDavenport commented 1 year ago

Nothing that I'm aware of on the server which is what this tested. This was just a couple one-off tests on my local machine. Its very possible I screwed something up by actually using the machine for something during the period.(Although I tried not to)

To get more concrete answers we'd need to actually spin something up to better test with something like gatling, and in terms of latency I'd need to switch from wrk to wrk2. I'll see if I can run the latter later.

wjoel commented 1 year ago

I've been spending quite a bit of time on this today, and I'm not seeing anything like 20% improvement. It's possible that results are much better at low levels of concurrency, as hey seems to use - that program does not seem to be very good for testing those things, to be honest, and I'd suggest using wrk instead and trying different numbers of connections to get a better idea of the performance changes. Maybe performance has also increased more when using fewer threads? I'm not really sure. My results are from a Ryzen 5900X (12 cores, 24 threads) on a hot summer day.

TFB results are quite similar, but it's looks like there's a small improvement when enabling polling. There's some variance, perhaps more so than usual because it's a very warm day here in Oslo with a bit over 30 degrees Celsius, so I ran it several times. 1.0-416dd46-SNAPSHOT baseline https://www.techempower.com/benchmarks/#section=test&shareid=882fc784-4260-45e9-a6ae-e875b407e191 1.0-416dd46-SNAPSHOT baseline take 2 https://www.techempower.com/benchmarks/#section=test&shareid=e18dfeb9-b66c-4835-805d-8ff23724be29 1.0-416dd46-SNAPSHOT polling https://www.techempower.com/benchmarks/#section=test&shareid=468d67ed-e10b-41e7-ae13-93c0534f9c18 1.0-416dd46-SNAPSHOT polling take 2 https://www.techempower.com/benchmarks/#section=test&shareid=59b9fc19-d633-4ccf-85a1-16589eb1c42a

I was wondering if maybe there was some important fix in 0.23 that hadn't been merged into main yet, so I also ran it with 0.23.20, but the overall picture is similar. 0.23.20 baseline https://www.techempower.com/benchmarks/#section=test&shareid=d39c3e94-5a3c-472a-9a63-168e350b85e9 0.23.20 baseline take 2 https://www.techempower.com/benchmarks/#section=test&shareid=7bf77b3c-ae15-4702-966a-405213e6fe8f 0.23.20 polling https://www.techempower.com/benchmarks/#section=test&shareid=e8406980-2a06-4cce-824d-398e73a15e92 0.23.20 polling take 2 https://www.techempower.com/benchmarks/#section=test&shareid=fca65726-83cc-4df2-95d5-ebbb1c1b6666

I then ran the Ember benchmark ("simplest") by @ChristopherDavenport and using wrk with 300 connections, otherwise the same settings as TFB is using. I first ran it with Eclipse Temurin JDK 17 (just starting RealApp from within IntelliJ, using -Xmx4g -Xms2g as VM options), but didn't see much difference between polling and not, so then I changed to Azul Zulu Community JDK 20 and explored further. There seems to be quite a bit of variance sometimes when running these tests (probably GC, but I'm not sure), so I ran them 10 times: simplest ember baseline, Azul 20, -Xmx4g -Xms2g wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ | grep Requests/sec ; done Requests/sec: 111094.77 Requests/sec: 116257.67 Requests/sec: 118451.51 Requests/sec: 120371.12 Requests/sec: 120780.44 Requests/sec: 117579.88 Requests/sec: 116879.41 Requests/sec: 119444.93 Requests/sec: 113217.95 Requests/sec: 116717.88 simplest ember polling, Azul 20, -Xmx4g -Xms2g wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ | grep Requests/sec ; done Requests/sec: 119709.67 Requests/sec: 120327.51 Requests/sec: 116398.66 Requests/sec: 119436.00 Requests/sec: 120864.85 Requests/sec: 116291.97 Requests/sec: 118581.67 Requests/sec: 116910.25 Requests/sec: 116744.99 Requests/sec: 120922.49

They seem to be close enough that I'd be hesitant to call it anything other than a draw.

Enabling pipelining (as TFB does) improves the numbers and seems to make them more consistent, but there's still not much of a difference when using polling: simplest ember baseline, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 24: wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 24 | grep Requests/sec ; done Requests/sec: 188395.82 Requests/sec: 187820.24 Requests/sec: 187938.91 Requests/sec: 187871.00 Requests/sec: 187557.91 Requests/sec: 187908.12 Requests/sec: 186906.13 Requests/sec: 187941.53 Requests/sec: 187964.85 Requests/sec: 188687.31 simplest ember polling, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 24: wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 24 | grep Requests/sec ; done Requests/sec: 188291.70 Requests/sec: 188193.24 Requests/sec: 188321.19 Requests/sec: 188406.69 Requests/sec: 188796.68 Requests/sec: 187570.31 Requests/sec: 188174.94 Requests/sec: 181370.82 Requests/sec: 186937.12 Requests/sec: 188360.46

Results get better, but variance increases, when running with -- 16 as the argument instead (which I guess has to do with the number of threads used when making requests, but I haven't checked), and the variance is lower when using polling though it may have been a coincidence: simplest ember baseline, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 16: wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 16 | grep Requests/sec ; done Requests/sec: 264116.61 Requests/sec: 174607.66 Requests/sec: 131080.75 Requests/sec: 262927.01 Requests/sec: 131162.33 Requests/sec: 131334.81 Requests/sec: 252555.60 Requests/sec: 231018.25 Requests/sec: 202961.52 Requests/sec: 190722.09 simplest ember polling, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 16: wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 16 | grep Requests/sec ; done Requests/sec: 252128.79 Requests/sec: 243816.50 Requests/sec: 237298.43 Requests/sec: 240201.98 Requests/sec: 240870.38 Requests/sec: 205297.17 Requests/sec: 243058.46 Requests/sec: 239905.62 Requests/sec: 179937.06 Requests/sec: 232942.17

I think results are promising, as nothing seems to have gotten worse and it looks like it quite likely is better, but I'd love to see more benchmark results. 20% may be possible, but only in some circumstances, it seems?

ChristopherDavenport commented 1 year ago

Thank you for putting all the work into that! That's some really good information!

armanbilge commented 1 year ago

For folks following along here: we've also published a snapshot of @antoniojimeneznieto's work implementing a JVM polling system based on io_uring. The initial prototype piggy-backs on Netty's internal io_uring APIs.

Please give it a try! The linked PR demonstrates how to create a JVM Ember server using the fs2-io_uring snapshot.

https://github.com/armanbilge/fs2-io_uring/pull/78#issuecomment-1689082531

wjoel commented 1 year ago

I ran the TFB benchmarks with my branch using your latest and greatest snapshots last weekend, and it wasn't worth writing about, slightly worse results than before. Well, turns out I didn't notice that withSocketGroup(UringSocketGroup[IO]) in your example, and it makes quite the difference...

Baseline: https://www.techempower.com/benchmarks/#section=test&shareid=9b70928b-24e8-4a39-a5dc-7832d8b02cd6&test=plaintext fs2_iouring and also withSocketGroup(UringSocketGroup[IO]): https://www.techempower.com/benchmarks/#section=test&shareid=2a700d9b-1d3d-4835-b366-12dbeb063575&test=plaintext

On "meaningless" benchmarks you've made Ember approximately 340% faster, and I'd bet real money that it translates into real-world improvements as well. Whereas before, my CPU cores (or vCPU cores) were at most 30% loaded when running those benchmarks, they're now blazing along at 100%, as they should.

Fantastic. Amazing. Well done.

djspiewak commented 1 year ago

On "meaningless" benchmarks you've made Ember approximately 340% faster, and I'd bet real money that it translates into real-world improvements as well. Whereas before, my CPU cores (or vCPU cores) were at most 30% loaded when running those benchmarks, they're now blazing along at 100%, as they should.

What the holy crap.

I expected io_uring to make things faster but I didn't expect it to be that much faster. Long way to go to productionalize this and I'm sure there's loads of stuff that'll get faster and slower along the way, but WOW. The fact that this is happening while we're still going through Netty's indirection is pretty impressive.

djspiewak commented 1 year ago

@wjoel I had an interesting thought: these results almost certainly say something about our syscall overhead. If we're reaping this magnitude of performance improvement just from swapping out the polling system, then it's basically saying that Ember is almost entirely bounded by the overhead of NIO. We won't know for sure until we get some more polling systems into the mix (this is a three variable algebra problem and we only have two equations so far), but this is really fascinating.

typelevel / cats-effect

Benchmark polling system powered Ember and friends #3692

FS2 3.7.0

FS2 3.8-1af22dd