Open armanbilge opened 1 year ago
CI is green in both http4s and Skunk.
I wrote the laziest server 😅
//> using dep co.fs2::fs2-io::3.8-1af22dd
//> using dep org.http4s::http4s-ember-server::0.23.20
import cats.effect.*
import org.http4s.ember.server.EmberServerBuilder
object App extends IOApp.Simple:
def run = EmberServerBuilder.default[IO].build.useForever
Then I benchmarked it with hey
(after some warmup). Looks like a 20% improvement!
Summary:
Total: 30.0515 secs
Slowest: 0.0634 secs
Fastest: 0.0001 secs
Average: 0.0028 secs
Requests/sec: 17845.4902
Summary:
Total: 30.0243 secs
Slowest: 0.1221 secs
Fastest: 0.0001 secs
Average: 0.0023 secs
Requests/sec: 21782.6589
Thanks to @ChristopherDavenport for doing some additional benchmarking! tl;dr Ember has now surpassed Blaze 🔥
https://gist.github.com/ChristopherDavenport/2e5ad15cd293aa0816090f8677b8cc3b
w00t! 🔥
Ember has now surpassed Blaze 🔥
On throughput but not latency, right? Also, has something changed to Ember's latency? cc @ChristopherDavenport Or maybe I'm just misreading the results...
Ember Before
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.19ms 17.05ms 487.71ms 98.89%
Req/Sec 5.05k 2.51k 11.17k 61.68%
Ember After
Thread Stats Avg Stdev Max +/- Stdev
Latency 8.98ms 55.97ms 1.30s 98.65%
Req/Sec 7.06k 3.41k 14.89k 66.11%
Nothing that I'm aware of on the server which is what this tested. This was just a couple one-off tests on my local machine. Its very possible I screwed something up by actually using the machine for something during the period.(Although I tried not to)
To get more concrete answers we'd need to actually spin something up to better test with something like gatling, and in terms of latency I'd need to switch from wrk to wrk2. I'll see if I can run the latter later.
I've been spending quite a bit of time on this today, and I'm not seeing anything like 20% improvement. It's possible that results are much better at low levels of concurrency, as hey
seems to use - that program does not seem to be very good for testing those things, to be honest, and I'd suggest using wrk
instead and trying different numbers of connections to get a better idea of the performance changes. Maybe performance has also increased more when using fewer threads? I'm not really sure. My results are from a Ryzen 5900X (12 cores, 24 threads) on a hot summer day.
TFB results are quite similar, but it's looks like there's a small improvement when enabling polling. There's some variance, perhaps more so than usual because it's a very warm day here in Oslo with a bit over 30 degrees Celsius, so I ran it several times. 1.0-416dd46-SNAPSHOT baseline https://www.techempower.com/benchmarks/#section=test&shareid=882fc784-4260-45e9-a6ae-e875b407e191 1.0-416dd46-SNAPSHOT baseline take 2 https://www.techempower.com/benchmarks/#section=test&shareid=e18dfeb9-b66c-4835-805d-8ff23724be29 1.0-416dd46-SNAPSHOT polling https://www.techempower.com/benchmarks/#section=test&shareid=468d67ed-e10b-41e7-ae13-93c0534f9c18 1.0-416dd46-SNAPSHOT polling take 2 https://www.techempower.com/benchmarks/#section=test&shareid=59b9fc19-d633-4ccf-85a1-16589eb1c42a
I was wondering if maybe there was some important fix in 0.23 that hadn't been merged into main yet, so I also ran it with 0.23.20, but the overall picture is similar. 0.23.20 baseline https://www.techempower.com/benchmarks/#section=test&shareid=d39c3e94-5a3c-472a-9a63-168e350b85e9 0.23.20 baseline take 2 https://www.techempower.com/benchmarks/#section=test&shareid=7bf77b3c-ae15-4702-966a-405213e6fe8f 0.23.20 polling https://www.techempower.com/benchmarks/#section=test&shareid=e8406980-2a06-4cce-824d-398e73a15e92 0.23.20 polling take 2 https://www.techempower.com/benchmarks/#section=test&shareid=fca65726-83cc-4df2-95d5-ebbb1c1b6666
I then ran the Ember benchmark ("simplest") by @ChristopherDavenport and using wrk with 300 connections, otherwise the same settings as TFB is using. I first ran it with Eclipse Temurin JDK 17 (just starting RealApp from within IntelliJ, using -Xmx4g -Xms2g
as VM options), but didn't see much difference between polling and not, so then I changed to Azul Zulu Community JDK 20 and explored further. There seems to be quite a bit of variance sometimes when running these tests (probably GC, but I'm not sure), so I ran them 10 times:
simplest ember baseline, Azul 20, -Xmx4g -Xms2g
wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ | grep Requests/sec ; done
Requests/sec: 111094.77
Requests/sec: 116257.67
Requests/sec: 118451.51
Requests/sec: 120371.12
Requests/sec: 120780.44
Requests/sec: 117579.88
Requests/sec: 116879.41
Requests/sec: 119444.93
Requests/sec: 113217.95
Requests/sec: 116717.88
simplest ember polling, Azul 20, -Xmx4g -Xms2g
wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ | grep Requests/sec ; done
Requests/sec: 119709.67
Requests/sec: 120327.51
Requests/sec: 116398.66
Requests/sec: 119436.00
Requests/sec: 120864.85
Requests/sec: 116291.97
Requests/sec: 118581.67
Requests/sec: 116910.25
Requests/sec: 116744.99
Requests/sec: 120922.49
They seem to be close enough that I'd be hesitant to call it anything other than a draw.
Enabling pipelining (as TFB does) improves the numbers and seems to make them more consistent, but there's still not much of a difference when using polling: simplest ember baseline, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 24: wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 24 | grep Requests/sec ; done Requests/sec: 188395.82 Requests/sec: 187820.24 Requests/sec: 187938.91 Requests/sec: 187871.00 Requests/sec: 187557.91 Requests/sec: 187908.12 Requests/sec: 186906.13 Requests/sec: 187941.53 Requests/sec: 187964.85 Requests/sec: 188687.31 simplest ember polling, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 24: wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 24 | grep Requests/sec ; done Requests/sec: 188291.70 Requests/sec: 188193.24 Requests/sec: 188321.19 Requests/sec: 188406.69 Requests/sec: 188796.68 Requests/sec: 187570.31 Requests/sec: 188174.94 Requests/sec: 181370.82 Requests/sec: 186937.12 Requests/sec: 188360.46
Results get better, but variance increases, when running with -- 16
as the argument instead (which I guess has to do with the number of threads used when making requests, but I haven't checked), and the variance is lower when using polling though it may have been a coincidence:
simplest ember baseline, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 16:
wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 16 | grep Requests/sec ; done
Requests/sec: 264116.61
Requests/sec: 174607.66
Requests/sec: 131080.75
Requests/sec: 262927.01
Requests/sec: 131162.33
Requests/sec: 131334.81
Requests/sec: 252555.60
Requests/sec: 231018.25
Requests/sec: 202961.52
Requests/sec: 190722.09
simplest ember polling, Azul 20, -Xmx4g -Xms2g, with pipelining and -- 16:
wjoel@apollo:\~/dev/ember-polling$ for i in 1 2 3 4 5 6 7 8 9 0; do docker run --network=host techempower/tfb.wrk wrk --latency -d 15 -c 400 --timeout 8 -t 24 http://localhost:8080/ -s pipeline.lua -- 16 | grep Requests/sec ; done
Requests/sec: 252128.79
Requests/sec: 243816.50
Requests/sec: 237298.43
Requests/sec: 240201.98
Requests/sec: 240870.38
Requests/sec: 205297.17
Requests/sec: 243058.46
Requests/sec: 239905.62
Requests/sec: 179937.06
Requests/sec: 232942.17
I think results are promising, as nothing seems to have gotten worse and it looks like it quite likely is better, but I'd love to see more benchmark results. 20% may be possible, but only in some circumstances, it seems?
Thank you for putting all the work into that! That's some really good information!
For folks following along here: we've also published a snapshot of @antoniojimeneznieto's work implementing a JVM polling system based on io_uring. The initial prototype piggy-backs on Netty's internal io_uring APIs.
Please give it a try! The linked PR demonstrates how to create a JVM Ember server using the fs2-io_uring snapshot.
I ran the TFB benchmarks with my branch using your latest and greatest snapshots last weekend, and it wasn't worth writing about, slightly worse results than before. Well, turns out I didn't notice that withSocketGroup(UringSocketGroup[IO])
in your example, and it makes quite the difference...
Baseline: https://www.techempower.com/benchmarks/#section=test&shareid=9b70928b-24e8-4a39-a5dc-7832d8b02cd6&test=plaintext fs2_iouring and also withSocketGroup(UringSocketGroup[IO]): https://www.techempower.com/benchmarks/#section=test&shareid=2a700d9b-1d3d-4835-b366-12dbeb063575&test=plaintext
On "meaningless" benchmarks you've made Ember approximately 340% faster, and I'd bet real money that it translates into real-world improvements as well. Whereas before, my CPU cores (or vCPU cores) were at most 30% loaded when running those benchmarks, they're now blazing along at 100%, as they should.
Fantastic. Amazing. Well done.
On "meaningless" benchmarks you've made Ember approximately 340% faster, and I'd bet real money that it translates into real-world improvements as well. Whereas before, my CPU cores (or vCPU cores) were at most 30% loaded when running those benchmarks, they're now blazing along at 100%, as they should.
What the holy crap.
I expected io_uring to make things faster but I didn't expect it to be that much faster. Long way to go to productionalize this and I'm sure there's loads of stuff that'll get faster and slower along the way, but WOW. The fact that this is happening while we're still going through Netty's indirection is pretty impressive.
@wjoel I had an interesting thought: these results almost certainly say something about our syscall overhead. If we're reaping this magnitude of performance improvement just from swapping out the polling system, then it's basically saying that Ember is almost entirely bounded by the overhead of NIO. We won't know for sure until we get some more polling systems into the mix (this is a three variable algebra problem and we only have two equations so far), but this is really fascinating.
I've published a snapshot of FS2 I/O that integrates with the new polling system (based on CE
3.6-e9aeb8c
).You can drop-in this dependency to any application using the latest stable versions of http4s Ember, Skunk, etc. Please try it out and report back!! We would really appreciate it 😊
Follow-up to: