Open zamazan4ik opened 11 months ago
Thank you for your issue! I definitely agree with adding it as Quilkin is nearly entirely CPU bound from send_to
and recv_from
, so the more we can optimize are the CPU time, the more clients a single proxy can handle. FWIW I've mostly been using fortio
for benchmarking. Mostly with flamegraphs but you could also use perf. I've never managed to get iperf
working because it requires UDP and TCP where as fortio
only needs UDP.
fortio udp-echo
cargo run --release -- proxy --to 127.0.0.1:8078
fortio load -c 3000 -qps 1000000 udp://127.0.0.1:7777/
@XAMPPRocky I just tried your instructions above and on my Linux machines nothing happens - fortio
does not generate the test load. Also, quilkin
does not react properly on CTRL+C in the terminal:
taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'
2023-10-20T09:50:41.791435Z INFO quilkin::cli: src/cli.rs: Starting Quilkin version="0.8.0-dev" commit="aeb2871bbfa7144cc007a10afa3300f1f6ae1815"
2023-10-20T09:50:41.791571Z INFO quilkin::cli::admin: src/cli/admin.rs: Starting admin endpoint address=[::]:8000
2023-10-20T09:50:41.791741Z INFO quilkin::cli::proxy: src/cli/proxy.rs: Starting port=7777 proxy_id="fedora"
2023-10-20T09:50:41.791830Z INFO quilkin::cli::proxy: src/cli/proxy.rs: Quilkin is ready
^C2023-10-20T09:53:50.908715Z INFO quilkin::cli: src/cli.rs: shutting down from signal signal=SIGINT
2023-10-20T09:53:50.908821Z INFO quilkin::cli::proxy: src/cli/proxy.rs: waiting for active sessions to expire sessions=996
^C^C^C^C^C^C^C[1] 163477 killed taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'
And the only option to close it is SIGKILL. Fortio instances are started exactly as you wrote above. Did I just miss something obvious?
Oh, it seems like just something about overloading issues (maybe connections). The benchmark started fine when I lowered the connection number and target QPS. Sorry for the ping :)
Yeah, you need to adjust the -c
to match your system, as it will try to spawn that many threads and sockets.
I performed some benchmarks and want to share my results.
main
branch on commit aeb2871bbfa7144cc007a10afa3300f1f6ae1815
For benchmarking purposes, I use the setup from https://github.com/googleforgames/quilkin/issues/834#issuecomment-1772183595 (suggested by @XAMPPRocky). The only addition from my side is using taskset
to reduce the influence of the OS thread scheduling. So the actual commands are:
taskset -c 23 fortio udp-echo
- Servertaskset -c 0 quilkin proxy --to '127.0.0.1:8078'
- Quilkintaskset -c 11-12 fortio load -c 300 -qps 80000 -t 120s udp://127.0.0.1:7777/
- ClientThe amount of QPS is tweaked to make sure that Quilkin's CPU core is always 100% (so we can easily measure the throughput improvements on the same hardware).
In this benchmark, I use 4 build configurations:
codegen-units=1
+ lto = fat
buildcodegen-units=1
+ lto = fat
+ PGO buildRelease build is done with cargo build --release
, PGO builds are done with cargo-pgo. PGO profiles are collected from the benchmark workload itself. Unfortunately, Release + LTO + PGO optimized builds do not work due to https://github.com/rust-lang/rust/issues/115344#issuecomment-1772573179 bug in Rustc (hopefully it will be fixed somewhen).
All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc). Between each run, quilkin
was restarted. There is some reference between runs but it's not critical.
For the build configurations:
quilkin_release
- Release buildquilkin_lto
- Release + codegen-units=1
+ lto = fat
buildquilkin_release_pgo_optimized
- Release + PGO optimized buildquilkin_lto_instrumented
- Release + codegen-units=1
+ lto = fat
+ PGO instrumentationquilkin_release_instrumented
- Release + PGO instrumentationI got the following results:
quilkin_release
: https://gist.github.com/zamazan4ik/77d6272d0ae80f823ee92526fe3df418quilkin_lto
: https://gist.github.com/zamazan4ik/b153cb2d61d6410721fda843b72a9ee3quilkin_release_pgo_optimized
: https://gist.github.com/zamazan4ik/be0dc2d58e1e753ff4838846a137a1ecquilkin_lto_instrumented
: https://gist.github.com/zamazan4ik/a6b1a29b516e5e67122ba6c1fe0c4f3fquilkin_release_instrumented
: https://gist.github.com/zamazan4ik/56ecf4fdc1de189bffb6f296eb27d901According to the tests, it's possible to achieve several percent improvements with LTO and/or PGO at least in the benchmark above.
Binary sizes for all binaries with size
command (just for reference):
size quilkin_release quilkin_lto quilkin_release_pgo_optimized quilkin_lto_instrumented quilkin_release_instrumented
text data bss dec hex filename
20172458 838016 3664 21014138 140a67a quilkin_release
16134916 558568 3576 16697060 fec6e4 quilkin_lto
17604486 848424 3664 18456574 1199ffe quilkin_release_pgo_optimized
45767668 10730544 13288 56511500 35e4c0c quilkin_lto_instrumented
59404083 15691328 13376 75108787 47a11b3 quilkin_release_instrumented
Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:
cargo-pgo
as well).Thank you for working on this @zamazan4ik! It's a shame we can't get both right now, is there one in particular that you'd recommend we adopt while we wait for it to be fixed?
Are you interested in contributing the work to make this happen in our CU?
is there one in particular that you'd recommend we adopt while we wait for it to be fixed?
I recommend enabling LTO (codegen-units=1
+ lto = "fat"
or ThinLTO) since it's much easier to integrate into the CI pipeline - it's just enabling several compiler flags. Compare it to PGO when you need to implement a 2-stage build pipeline. Later, when LTO + PGO bug is fixed in the upstream, you can start integrating PGO as an additional optimization step after LTO.
Are you interested in contributing the work to make this happen in our CU?
If you agree to start with LTO, the changes in general would be as simple as the following change to the Cargo.toml
file:
[profile.release]
lto = "fat"
codegen-units = 1
Since LTO (especially the Fat version) greatly slows down the build time (see my build time benchmarks above), you can enable LTO only for building actual releases, not on a usual CI build check. It's all up to you. I recommend you at the beginning just put these lines to the Cargo.toml
. And later if you have any issues with build times or smth like that - think about separating different profiles, etc.
Thanks also for doing this work - this is super interesting, and great to see the performance improvements.
Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:
This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.
I've never managed to get
iperf
working because it requires UDP and TCP where asfortio
only needs UDP.
Shall we switch out the iperf test for a fortio one? I'm not wedded to either, whatever is easiest to use!
Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.
Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.
Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.
Agreed. #835 filed.
Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.
Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build
keep them off. @zamazan4ik I assume that's possible?
This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.
Agree. Just to highlight - some projects enable such "heavy" optimization only for building actual binaries. E.g. Vector implements it via special release script. So if you decide to implement such an approach - there are already examples in the current ecosystem to take a look on.
Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?
Definitely! It's a good way to integrate PGO into the project.
Definitely! It's a good way to integrate PGO into the project.
If you would love to show us how it's done 😃 @zamazan4ik - would definitely love your help in this area for sure. Seems like an easy win to me 👍🏻
Sure. You can create an additional LTO-specific profile in Cargo.toml like it's done in G3 project. And then from the Makefile just call building Quilkin with specific Cargo profile.
Hi!
Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. E.g. PGO helps with optimizing Envoyproxy. According to the multiple tests, PGO can help with improving performance in many other cases. That's why I think trying to optimize the Quilkin with PGO can be a good idea.
Codegen units (CGU) setting to 1 and enabling LTO also can help with optimizing Quilkin performance due to possibly more aggressive inlining (and could help with reducing the binary size).
I can suggest the following action points:
Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.
For the Rust projects, I recommend starting experimenting with PGO with cargo-pgo.
Here are some examples of how PGO optimization is integrated in other projects:
configure
scriptI have already tried to perform PGO tests on my machine but met a bug (more details in https://github.com/googleforgames/quilkin/issues/833). I think we can wait before the fix or execute the benchmark somehow else (e.g. with
iperf
).