Closed TheYkk closed 6 months ago
I don't really understand what hardware or OS the benchmark results are from, on this machine (6C/12T Zen2 @ 2.1GHz) I get ~570 Mbps for Rust (both --release
and unoptimized!), Go and Atacama.
It even fails to scale when unlocking the CPU frequency, leading me to believe this is benchmarking the Linux TCP stack and/or context-switching for me (and I guess you'd see throughput proportional to core count).
I did try dune exec bench/echo_eio.exe
out of curiosity and that did only 16 Mbps (28 Mbps with unlocked CPU frequency), so maybe the benchmark was done on a machine with more cores, 8C/16T maybe?
But then the fact that none of the other results are better than my ~570 Mbps upper bound, confuses me greatly. Is this an already loaded system? I did this on a desktop machine and got shockingly repeatable numbers (as long as I wasn't actively using a browser or any other apps etc.).
Was the benchmarking done on Windows or macOS? I'm not super familiar with their TCP stacks but I wouldn't be surprised if they were consistently (or, worse, inconsistently) slower than Linux (and that's before you add in scheduling, async APIs, etc.).
EDIT: oh, based on @leostera's twitch stream, they dev on macOS, so unless the benchmark was done by SSH-ing into a Linux server, it must've been ran on macOS, which I still haven't gotten around to setting up because the only time I need it, is for baffling bug reports, unique to Apple software (less so hardware, at least).
Not sure if this is even the right thing to modify but on a whim I decided to try this:
diff --git a/bench/echo_eio.ml b/bench/echo_eio.ml
index ae2dfb8..bd58be6 100644
--- a/bench/echo_eio.ml
+++ b/bench/echo_eio.ml
@@ -6,7 +6,7 @@ module Server = struct
module Read = Eio.Buf_read
let handle_client flow _addr =
- let data = Read.of_flow flow ~max_size:128 in
+ let data = Read.of_flow flow ~max_size:(50*1024) in
Eio.Flow.copy (Read.as_flow data) flow
let run socket =
This gives me 280 Mbps (400 Mbps w/ unlocked CPU frequency) from echo_eio
, which is still worse than the 570 Mbps everything else reaches, but the difference can at least now be explained by OCaml Eio inefficiency (which Atacama bypasses, or maybe this is again just bad benchmarking, I have never worked with Eio so it's hard to tell at a glance if there's extra copies in there), instead of being an absurd gap.
I still have no idea how the reported results even make sense (other than the macOS theory which I can't test).
For what it's worth, the numbers by themselves are not that important. It's the numbers relative to the other languages/stacks that are important. So we can compare how Atacama is doing.
The Rust and Go ones were comically bad, so it's clear that something is likely wrong, and I think he's hoping the broader dev community can help make them better.
Also, the goal shouldn't be optimizing the Go or Rust (or any of them) to a fine degree, but doing it in a way that is generally considered the proper and standard way to do something like this.
This is not a benchmarking project, the benchmarking is just a tool used to help measure the progress and impact of the development work on Atacama.
The Rust and Go ones were comically bad, so it's clear that something is likely wrong, and I think he's hoping the broader dev community can help make them better.
I'm not sure it was clear what was observed: what's wrong is on @leostera's end, as Atacama, Rust, Go (and even weird combinations like Rust with optimizations disabled) all have the same numbers on Linux.
On Linux, there's nothing really to optimize, outside of echo_eio
(which seems to mostly just be missing the buffer size).
The problem is @leostera's numbers both don't make sense (the benchmarks just saturate the OS TCP stack, there's no interesting differences between languages or even frameworks, that I can think of), and have not been reproduced.
My best hope is that someone with access to a macOS machine can reproduce the bad numbers (or rather, the discrepancies in them) so we can figure out what the bad macOS syscalls are, or whatever pitfall is happening (could even be the default heap allocator for all I know).
Hi folks! Thanks for the thread. Let me clarify some stuff:
I'm runnning on a Macbook Pro M1 Max (10 cores according to htop/nproc) with 64GB of RAM.
Atacama (Riot) uses kqueue
on macOS and epoll
on Linux.
The goal is to illustrate the relative performance, but this can be made more explicit in the README.md too.
The lower-level syscalls between Eio and Riot (the scheduler Atacama builds on) should be roughly the same, but the scheduling between Riot and Eio is very different.
@ryanwinchester: For what it's worth, the numbers by themselves are not that important. It's the numbers relative to the other languages/stacks that are important. So we can compare how Atacama is doing.
As @ryanwinchester said, this is just a way to compare how Atacama is doing. One of these day's we'll do the work to submit to techempower.
Also noticing the same thing here -- on linux, the performance delta is nowhere near as huge. Go performs roughly the same as elixir/erlang implementations with your current benchmark code. If using io.copy(conn, conn)
instead of conn.Read
/ conn.Write
it's roughly 2-3x the performance of the elixir implementation.
That's good tho! Feel free to PR the Go improvements 🙏🏼 – I don't want to misrepresent any platform here.
I don't want to misrepresent any platform here.
The problem we were trying to communicate to you is that you already are misrepresenting (and not just one or two, but almost everyone!).
Your relative distinctions have not reproduced so far for anyone on Linux, where it appears that there is basically no difference between languages or frameworks, with OCaml5's Eio being the only exception.
And even the eio benchmark becomes only 2x slower (instead of 20x or worse) than everything else, once you remember to update it to use 50kB buffers instead of 128 byte buffers.
Best case scenario you found a real problem on macOS: if so, you should show both macOS and Linux numbers (tho the Linux numbers are going to be really boringly all equal to eachother for the most part).
But it's also possible you had any other software running at the same time on your macOS - given you don't seem to be doing any advanced benchmarking techniques like core pinning, and presumably all benchmark use all your cores, any background load will make your testing highly unscientific.
Thankfully, there is a way to check: run all the benchmarks again, in different orders, and discard any outliers because those must've been unrelated load on your macOS machine.
Okay, I did another run of everything after doing a few updates and these are the numbers I got:
name | 100conn/10s | 100conn/60s |
---|---|---|
OCaml (Atacama) | 422.3 Mbps | 403.7 Mbps |
OCaml (Eio) | 157.0 Mbps | 173.3 Mbps |
Erlang (ranch) | 512.3 Mbps | 509.2 Mbps |
Elixir (thousand_island) | 516.5 Mbps | 522.6 Mbps |
Go (stdlib) | 199.8 Mbps | 219.0 Mbps |
Rust (tokio) | 538.9 Mbps | 538.8 Mbps |
Tons of improvements across the board, but it looks like on my machine we won't get a lot more then ~540Mbps. That may be saturating the network stack? Seems like a nice ceil to work towards.
If any of you has all the linux numbers, I'd be happy to add them there too! ✨
Leaving this open because I think this is a good discussion to have, as long as the focus is on improving the code in the benchmarks and collaborating to get a better idea of how Atacama performs on different machines.
And it goes without saying but if you can improve on the Eio or Go code, just do it, or tell me how to do it and I'll get it done. Super happy to.
@eddyb did you see this? https://github.com/ocaml-multicore/eio/pull/663 🚀
I saw golang was handling at 200mbps bandwidth and wanted to test it myself.
Golang test result
Rust test result
Easily both of them handles 1 gbps.