Benchmark - Githubissues

TheYkk commented 10 months ago

I saw golang was handling at 200mbps bandwidth and wanted to test it myself.

root@alma-dev-env ~/dev-env/atacama/bench/rust-echo$ uname -a                                                                              130 ↵ main 
Linux alma-dev-env 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov 7 14:54:22 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

root@alma-dev-env ~/dev-env/atacama/bench/rust-echo$ go version                                                                                  main 
go version go1.21.5 linux/amd64

Golang test result

root@alma-dev-env ~/dev-env/atacama/bench$ tcpkali 0.0.0.0:2112 -m 'hello world' -c 100 -T 10s                                                 main 

Destination: [0.0.0.0]:2112
Interface lo address [127.0.0.1]:0
Using interface lo to connect to [0.0.0.0]:2112
Ramped up to 100 connections.
Total data sent:     62276.9 MiB (65302063033 bytes)
Total data received: 62323.5 MiB (65350918724 bytes)
Bandwidth per channel: 1044.599⇅ Mbps (130574.8 kBps)
Aggregate bandwidth: 52249.464↓, 52210.402↑ Mbps
Packet rate estimate: 4786058.3↓, 4536452.9↑ (12↓, 45↑ TCP MSS/op)

Rust test result

Destination: [0.0.0.0]:2112
Interface lo address [127.0.0.1]:0
Using interface lo to connect to [0.0.0.0]:2112
Ramped up to 100 connections.
Total data sent:     63535.5 MiB (66621801686 bytes)
Total data received: 63199.2 MiB (66269175265 bytes)
Bandwidth per channel: 1062.681⇅ Mbps (132835.2 kBps)
Aggregate bandwidth: 52993.073↓, 53275.056↑ Mbps
Packet rate estimate: 4861914.7↓, 4657961.4↑ (12↓, 43↑ TCP MSS/op)
Test duration: 10.0042 s.

Easily both of them handles 1 gbps.

eddyb commented 10 months ago

I don't really understand what hardware or OS the benchmark results are from, on this machine (6C/12T Zen2 @ 2.1GHz) I get ~570 Mbps for Rust (both --release and unoptimized!), Go and Atacama.

It even fails to scale when unlocking the CPU frequency, leading me to believe this is benchmarking the Linux TCP stack and/or context-switching for me (and I guess you'd see throughput proportional to core count).

I did try dune exec bench/echo_eio.exe out of curiosity and that did only 16 Mbps (28 Mbps with unlocked CPU frequency), so maybe the benchmark was done on a machine with more cores, 8C/16T maybe?

But then the fact that none of the other results are better than my ~570 Mbps upper bound, confuses me greatly. Is this an already loaded system? I did this on a desktop machine and got shockingly repeatable numbers (as long as I wasn't actively using a browser or any other apps etc.).

Was the benchmarking done on Windows or macOS? I'm not super familiar with their TCP stacks but I wouldn't be surprised if they were consistently (or, worse, inconsistently) slower than Linux (and that's before you add in scheduling, async APIs, etc.).

EDIT: oh, based on @leostera's twitch stream, they dev on macOS, so unless the benchmark was done by SSH-ing into a Linux server, it must've been ran on macOS, which I still haven't gotten around to setting up because the only time I need it, is for baffling bug reports, unique to Apple software (less so hardware, at least).

eddyb commented 10 months ago

Not sure if this is even the right thing to modify but on a whim I decided to try this:

diff --git a/bench/echo_eio.ml b/bench/echo_eio.ml
index ae2dfb8..bd58be6 100644
--- a/bench/echo_eio.ml
+++ b/bench/echo_eio.ml
@@ -6,7 +6,7 @@ module Server = struct
   module Read = Eio.Buf_read

   let handle_client flow _addr =
-    let data = Read.of_flow flow ~max_size:128 in
+    let data = Read.of_flow flow ~max_size:(50*1024) in
     Eio.Flow.copy (Read.as_flow data) flow

   let run socket =

This gives me 280 Mbps (400 Mbps w/ unlocked CPU frequency) from echo_eio, which is still worse than the 570 Mbps everything else reaches, but the difference can at least now be explained by OCaml Eio inefficiency (which Atacama bypasses, or maybe this is again just bad benchmarking, I have never worked with Eio so it's hard to tell at a glance if there's extra copies in there), instead of being an absurd gap.

I still have no idea how the reported results even make sense (other than the macOS theory which I can't test).

ryanwinchester commented 10 months ago

For what it's worth, the numbers by themselves are not that important. It's the numbers relative to the other languages/stacks that are important. So we can compare how Atacama is doing.

The Rust and Go ones were comically bad, so it's clear that something is likely wrong, and I think he's hoping the broader dev community can help make them better.

Also, the goal shouldn't be optimizing the Go or Rust (or any of them) to a fine degree, but doing it in a way that is generally considered the proper and standard way to do something like this.

This is not a benchmarking project, the benchmarking is just a tool used to help measure the progress and impact of the development work on Atacama.

eddyb commented 10 months ago

The Rust and Go ones were comically bad, so it's clear that something is likely wrong, and I think he's hoping the broader dev community can help make them better.

I'm not sure it was clear what was observed: what's wrong is on @leostera's end, as Atacama, Rust, Go (and even weird combinations like Rust with optimizations disabled) all have the same numbers on Linux.

On Linux, there's nothing really to optimize, outside of echo_eio (which seems to mostly just be missing the buffer size).

The problem is @leostera's numbers both don't make sense (the benchmarks just saturate the OS TCP stack, there's no interesting differences between languages or even frameworks, that I can think of), and have not been reproduced.

My best hope is that someone with access to a macOS machine can reproduce the bad numbers (or rather, the discrepancies in them) so we can figure out what the bad macOS syscalls are, or whatever pitfall is happening (could even be the default heap allocator for all I know).

leostera commented 10 months ago

Hi folks! Thanks for the thread. Let me clarify some stuff:

I'm runnning on a Macbook Pro M1 Max (10 cores according to htop/nproc) with 64GB of RAM.
Atacama (Riot) uses kqueue on macOS and epoll on Linux.
The goal is to illustrate the relative performance, but this can be made more explicit in the README.md too.
The lower-level syscalls between Eio and Riot (the scheduler Atacama builds on) should be roughly the same, but the scheduling between Riot and Eio is very different.

@ryanwinchester: For what it's worth, the numbers by themselves are not that important. It's the numbers relative to the other languages/stacks that are important. So we can compare how Atacama is doing.

As @ryanwinchester said, this is just a way to compare how Atacama is doing. One of these day's we'll do the work to submit to techempower.

Igneous commented 10 months ago

Also noticing the same thing here -- on linux, the performance delta is nowhere near as huge. Go performs roughly the same as elixir/erlang implementations with your current benchmark code. If using io.copy(conn, conn) instead of conn.Read / conn.Write it's roughly 2-3x the performance of the elixir implementation.

leostera commented 10 months ago

That's good tho! Feel free to PR the Go improvements 🙏🏼 – I don't want to misrepresent any platform here.

eddyb commented 10 months ago

I don't want to misrepresent any platform here.

The problem we were trying to communicate to you is that you already are misrepresenting (and not just one or two, but almost everyone!).

Your relative distinctions have not reproduced so far for anyone on Linux, where it appears that there is basically no difference between languages or frameworks, with OCaml5's Eio being the only exception.

And even the eio benchmark becomes only 2x slower (instead of 20x or worse) than everything else, once you remember to update it to use 50kB buffers instead of 128 byte buffers.

Best case scenario you found a real problem on macOS: if so, you should show both macOS and Linux numbers (tho the Linux numbers are going to be really boringly all equal to eachother for the most part).

But it's also possible you had any other software running at the same time on your macOS - given you don't seem to be doing any advanced benchmarking techniques like core pinning, and presumably all benchmark use all your cores, any background load will make your testing highly unscientific.

Thankfully, there is a way to check: run all the benchmarks again, in different orders, and discard any outliers because those must've been unrelated load on your macOS machine.

leostera commented 10 months ago

Okay, I did another run of everything after doing a few updates and these are the numbers I got:

name	100conn/10s	100conn/60s
OCaml (Atacama)	422.3 Mbps	403.7 Mbps
OCaml (Eio)	157.0 Mbps	173.3 Mbps
Erlang (ranch)	512.3 Mbps	509.2 Mbps
Elixir (thousand_island)	516.5 Mbps	522.6 Mbps
Go (stdlib)	199.8 Mbps	219.0 Mbps
Rust (tokio)	538.9 Mbps	538.8 Mbps

Tons of improvements across the board, but it looks like on my machine we won't get a lot more then ~540Mbps. That may be saturating the network stack? Seems like a nice ceil to work towards.

If any of you has all the linux numbers, I'd be happy to add them there too! ✨

Leaving this open because I think this is a good discussion to have, as long as the focus is on improving the code in the benchmarks and collaborating to get a better idea of how Atacama performs on different machines.

leostera commented 10 months ago

And it goes without saying but if you can improve on the Eio or Go code, just do it, or tell me how to do it and I'll get it done. Super happy to.

leostera commented 10 months ago

@eddyb did you see this? https://github.com/ocaml-multicore/eio/pull/663 🚀

suri-framework / atacama

Benchmark #4