Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT

zamazan4ik commented 11 months ago

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. E.g. PGO helps with optimizing Envoyproxy. According to the multiple tests, PGO can help with improving performance in many other cases. That's why I think trying to optimize the Quilkin with PGO can be a good idea.

Codegen units (CGU) setting to 1 and enabling LTO also can help with optimizing Quilkin performance due to possibly more aggressive inlining (and could help with reducing the binary size).

I can suggest the following action points:

Perform PGO benchmarks on Quilkin. And if it shows improvements - add a note about possible improvements in Quilkin performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Quilkin according to their own workloads.
Optimize pre-built binaries

Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.

For the Rust projects, I recommend starting experimenting with PGO with cargo-pgo.

Here are some examples of how PGO optimization is integrated in other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I have already tried to perform PGO tests on my machine but met a bug (more details in https://github.com/googleforgames/quilkin/issues/833). I think we can wait before the fix or execute the benchmark somehow else (e.g. with iperf).

XAMPPRocky commented 11 months ago

Thank you for your issue! I definitely agree with adding it as Quilkin is nearly entirely CPU bound from send_to and recv_from, so the more we can optimize are the CPU time, the more clients a single proxy can handle. FWIW I've mostly been using fortio for benchmarking. Mostly with flamegraphs but you could also use perf. I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Server

fortio udp-echo

Quilkin

cargo run --release -- proxy --to 127.0.0.1:8078

Client

fortio load -c 3000 -qps 1000000 udp://127.0.0.1:7777/

zamazan4ik commented 11 months ago

@XAMPPRocky I just tried your instructions above and on my Linux machines nothing happens - fortio does not generate the test load. Also, quilkin does not react properly on CTRL+C in the terminal:

taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'
2023-10-20T09:50:41.791435Z  INFO quilkin::cli: src/cli.rs: Starting Quilkin version="0.8.0-dev" commit="aeb2871bbfa7144cc007a10afa3300f1f6ae1815"
2023-10-20T09:50:41.791571Z  INFO quilkin::cli::admin: src/cli/admin.rs: Starting admin endpoint address=[::]:8000
2023-10-20T09:50:41.791741Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Starting port=7777 proxy_id="fedora"
2023-10-20T09:50:41.791830Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Quilkin is ready
^C2023-10-20T09:53:50.908715Z  INFO quilkin::cli: src/cli.rs: shutting down from signal signal=SIGINT
2023-10-20T09:53:50.908821Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: waiting for active sessions to expire sessions=996
^C^C^C^C^C^C^C[1]    163477 killed     taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'

And the only option to close it is SIGKILL. Fortio instances are started exactly as you wrote above. Did I just miss something obvious?

zamazan4ik commented 11 months ago

Oh, it seems like just something about overloading issues (maybe connections). The benchmark started fine when I lowered the connection number and target QPS. Sorry for the ping :)

XAMPPRocky commented 11 months ago

Yeah, you need to adjust the -c to match your system, as it will try to spawn that many threads and sockets.

zamazan4ik commented 11 months ago

I performed some benchmarks and want to share my results.

Test environment

Fedora 38
Linux kernel 6.5.6
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.73
Quilkin version: the latest for now from the main branch on commit aeb2871bbfa7144cc007a10afa3300f1f6ae1815
Disabled Turbo boost

Benchmark setup

For benchmarking purposes, I use the setup from https://github.com/googleforgames/quilkin/issues/834#issuecomment-1772183595 (suggested by @XAMPPRocky). The only addition from my side is using taskset to reduce the influence of the OS thread scheduling. So the actual commands are:

taskset -c 23 fortio udp-echo - Server
taskset -c 0 quilkin proxy --to '127.0.0.1:8078' - Quilkin
taskset -c 11-12 fortio load -c 300 -qps 80000 -t 120s udp://127.0.0.1:7777/ - Client

The amount of QPS is tweaked to make sure that Quilkin's CPU core is always 100% (so we can easily measure the throughput improvements on the same hardware).

In this benchmark, I use 4 build configurations:

Release build
Release + codegen-units=1 + lto = fat build
Release + PGO build
Release + codegen-units=1 + lto = fat + PGO build

Release build is done with cargo build --release, PGO builds are done with cargo-pgo. PGO profiles are collected from the benchmark workload itself. Unfortunately, Release + LTO + PGO optimized builds do not work due to https://github.com/rust-lang/rust/issues/115344#issuecomment-1772573179 bug in Rustc (hopefully it will be fixed somewhen).

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc). Between each run, quilkin was restarted. There is some reference between runs but it's not critical.

Results

For the build configurations:

quilkin_release - Release build
quilkin_lto - Release + codegen-units=1 + lto = fat build
quilkin_release_pgo_optimized - Release + PGO optimized build
quilkin_lto_instrumented - Release + codegen-units=1 + lto = fat + PGO instrumentation
quilkin_release_instrumented - Release + PGO instrumentation

I got the following results:

quilkin_release: https://gist.github.com/zamazan4ik/77d6272d0ae80f823ee92526fe3df418
quilkin_lto: https://gist.github.com/zamazan4ik/b153cb2d61d6410721fda843b72a9ee3
quilkin_release_pgo_optimized: https://gist.github.com/zamazan4ik/be0dc2d58e1e753ff4838846a137a1ec
quilkin_lto_instrumented: https://gist.github.com/zamazan4ik/a6b1a29b516e5e67122ba6c1fe0c4f3f
quilkin_release_instrumented: https://gist.github.com/zamazan4ik/56ecf4fdc1de189bffb6f296eb27d901

According to the tests, it's possible to achieve several percent improvements with LTO and/or PGO at least in the benchmark above.

Binary sizes for all binaries with size command (just for reference):

size quilkin_release quilkin_lto quilkin_release_pgo_optimized quilkin_lto_instrumented quilkin_release_instrumented
   text    data     bss     dec     hex filename
20172458     838016    3664 21014138    140a67a quilkin_release
16134916     558568    3576 16697060     fec6e4 quilkin_lto
17604486     848424    3664 18456574    1199ffe quilkin_release_pgo_optimized
45767668    10730544      13288 56511500    35e4c0c quilkin_lto_instrumented
59404083    15691328      13376 75108787    47a11b3 quilkin_release_instrumented

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

Build time Quilkin Release: 1m 07s
Build time Quilkin Release + LTO: 4m 28s
Build time Quilkin Release + LTO + PGO Instrumentation: 6m 45s
Build time Quilkin Release + PGO Instrumented: 1m 16s
Build time Quilkin Release + PGO optimized: 53.49s

Possible further steps

Test LLVM BOLT applicability for Quilkin (can be done with cargo-pgo as well).

XAMPPRocky commented 11 months ago

Thank you for working on this @zamazan4ik! It's a shame we can't get both right now, is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

Are you interested in contributing the work to make this happen in our CU?

zamazan4ik commented 11 months ago

is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

I recommend enabling LTO (codegen-units=1 + lto = "fat" or ThinLTO) since it's much easier to integrate into the CI pipeline - it's just enabling several compiler flags. Compare it to PGO when you need to implement a 2-stage build pipeline. Later, when LTO + PGO bug is fixed in the upstream, you can start integrating PGO as an additional optimization step after LTO.

Are you interested in contributing the work to make this happen in our CU?

If you agree to start with LTO, the changes in general would be as simple as the following change to the Cargo.toml file:

[profile.release]
lto = "fat"
codegen-units = 1

Since LTO (especially the Fat version) greatly slows down the build time (see my build time benchmarks above), you can enable LTO only for building actual releases, not on a usual CI build check. It's all up to you. I recommend you at the beginning just put these lines to the Cargo.toml. And later if you have any issues with build times or smth like that - think about separating different profiles, etc.

markmandel commented 11 months ago

Thanks also for doing this work - this is super interesting, and great to see the performance improvements.

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Shall we switch out the iperf test for a fortio one? I'm not wedded to either, whatever is easiest to use!

XAMPPRocky commented 11 months ago

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

markmandel commented 11 months ago

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Agreed. #835 filed.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

https://github.com/googleforgames/quilkin/blob/aeb2871bbfa7144cc007a10afa3300f1f6ae1815/build/Makefile#L56-L59

zamazan4ik commented 11 months ago

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

Agree. Just to highlight - some projects enable such "heavy" optimization only for building actual binaries. E.g. Vector implements it via special release script. So if you decide to implement such an approach - there are already examples in the current ecosystem to take a look on.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

Definitely! It's a good way to integrate PGO into the project.

markmandel commented 11 months ago

Definitely! It's a good way to integrate PGO into the project.

If you would love to show us how it's done 😃 @zamazan4ik - would definitely love your help in this area for sure. Seems like an easy win to me 👍🏻

zamazan4ik commented 11 months ago

Sure. You can create an additional LTO-specific profile in Cargo.toml like it's done in G3 project. And then from the Makefile just call building Quilkin with specific Cargo profile.

googleforgames / quilkin