googleforgames / quilkin

Quilkin is a non-transparent UDP proxy specifically designed for use with large scale multiplayer dedicated game server deployments, to ensure security, access control, telemetry data, metrics and more.
Apache License 2.0
1.29k stars 93 forks source link

Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834

Open zamazan4ik opened 11 months ago

zamazan4ik commented 11 months ago

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. E.g. PGO helps with optimizing Envoyproxy. According to the multiple tests, PGO can help with improving performance in many other cases. That's why I think trying to optimize the Quilkin with PGO can be a good idea.

Codegen units (CGU) setting to 1 and enabling LTO also can help with optimizing Quilkin performance due to possibly more aggressive inlining (and could help with reducing the binary size).

I can suggest the following action points:

Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.

For the Rust projects, I recommend starting experimenting with PGO with cargo-pgo.

Here are some examples of how PGO optimization is integrated in other projects:

I have already tried to perform PGO tests on my machine but met a bug (more details in https://github.com/googleforgames/quilkin/issues/833). I think we can wait before the fix or execute the benchmark somehow else (e.g. with iperf).

XAMPPRocky commented 11 months ago

Thank you for your issue! I definitely agree with adding it as Quilkin is nearly entirely CPU bound from send_to and recv_from, so the more we can optimize are the CPU time, the more clients a single proxy can handle. FWIW I've mostly been using fortio for benchmarking. Mostly with flamegraphs but you could also use perf. I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Server

fortio udp-echo

Quilkin

cargo run --release -- proxy --to 127.0.0.1:8078

Client

fortio load -c 3000 -qps 1000000 udp://127.0.0.1:7777/
zamazan4ik commented 11 months ago

@XAMPPRocky I just tried your instructions above and on my Linux machines nothing happens - fortio does not generate the test load. Also, quilkin does not react properly on CTRL+C in the terminal:

taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'
2023-10-20T09:50:41.791435Z  INFO quilkin::cli: src/cli.rs: Starting Quilkin version="0.8.0-dev" commit="aeb2871bbfa7144cc007a10afa3300f1f6ae1815"
2023-10-20T09:50:41.791571Z  INFO quilkin::cli::admin: src/cli/admin.rs: Starting admin endpoint address=[::]:8000
2023-10-20T09:50:41.791741Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Starting port=7777 proxy_id="fedora"
2023-10-20T09:50:41.791830Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Quilkin is ready
^C2023-10-20T09:53:50.908715Z  INFO quilkin::cli: src/cli.rs: shutting down from signal signal=SIGINT
2023-10-20T09:53:50.908821Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: waiting for active sessions to expire sessions=996
^C^C^C^C^C^C^C[1]    163477 killed     taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'

And the only option to close it is SIGKILL. Fortio instances are started exactly as you wrote above. Did I just miss something obvious?

zamazan4ik commented 11 months ago

Oh, it seems like just something about overloading issues (maybe connections). The benchmark started fine when I lowered the connection number and target QPS. Sorry for the ping :)

XAMPPRocky commented 11 months ago

Yeah, you need to adjust the -c to match your system, as it will try to spawn that many threads and sockets.

zamazan4ik commented 11 months ago

I performed some benchmarks and want to share my results.

Test environment

Benchmark setup

For benchmarking purposes, I use the setup from https://github.com/googleforgames/quilkin/issues/834#issuecomment-1772183595 (suggested by @XAMPPRocky). The only addition from my side is using taskset to reduce the influence of the OS thread scheduling. So the actual commands are:

The amount of QPS is tweaked to make sure that Quilkin's CPU core is always 100% (so we can easily measure the throughput improvements on the same hardware).

In this benchmark, I use 4 build configurations:

Release build is done with cargo build --release, PGO builds are done with cargo-pgo. PGO profiles are collected from the benchmark workload itself. Unfortunately, Release + LTO + PGO optimized builds do not work due to https://github.com/rust-lang/rust/issues/115344#issuecomment-1772573179 bug in Rustc (hopefully it will be fixed somewhen).

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc). Between each run, quilkin was restarted. There is some reference between runs but it's not critical.

Results

For the build configurations:

I got the following results:

According to the tests, it's possible to achieve several percent improvements with LTO and/or PGO at least in the benchmark above.

Binary sizes for all binaries with size command (just for reference):

size quilkin_release quilkin_lto quilkin_release_pgo_optimized quilkin_lto_instrumented quilkin_release_instrumented
   text    data     bss     dec     hex filename
20172458     838016    3664 21014138    140a67a quilkin_release
16134916     558568    3576 16697060     fec6e4 quilkin_lto
17604486     848424    3664 18456574    1199ffe quilkin_release_pgo_optimized
45767668    10730544      13288 56511500    35e4c0c quilkin_lto_instrumented
59404083    15691328      13376 75108787    47a11b3 quilkin_release_instrumented

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

Possible further steps

XAMPPRocky commented 11 months ago

Thank you for working on this @zamazan4ik! It's a shame we can't get both right now, is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

Are you interested in contributing the work to make this happen in our CU?

zamazan4ik commented 11 months ago

is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

I recommend enabling LTO (codegen-units=1 + lto = "fat" or ThinLTO) since it's much easier to integrate into the CI pipeline - it's just enabling several compiler flags. Compare it to PGO when you need to implement a 2-stage build pipeline. Later, when LTO + PGO bug is fixed in the upstream, you can start integrating PGO as an additional optimization step after LTO.

Are you interested in contributing the work to make this happen in our CU?

If you agree to start with LTO, the changes in general would be as simple as the following change to the Cargo.toml file:

[profile.release]
lto = "fat"
codegen-units = 1

Since LTO (especially the Fat version) greatly slows down the build time (see my build time benchmarks above), you can enable LTO only for building actual releases, not on a usual CI build check. It's all up to you. I recommend you at the beginning just put these lines to the Cargo.toml. And later if you have any issues with build times or smth like that - think about separating different profiles, etc.

markmandel commented 11 months ago

Thanks also for doing this work - this is super interesting, and great to see the performance improvements.

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Shall we switch out the iperf test for a fortio one? I'm not wedded to either, whatever is easiest to use!

XAMPPRocky commented 11 months ago

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

markmandel commented 11 months ago

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Agreed. #835 filed.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

https://github.com/googleforgames/quilkin/blob/aeb2871bbfa7144cc007a10afa3300f1f6ae1815/build/Makefile#L56-L59

zamazan4ik commented 11 months ago

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

Agree. Just to highlight - some projects enable such "heavy" optimization only for building actual binaries. E.g. Vector implements it via special release script. So if you decide to implement such an approach - there are already examples in the current ecosystem to take a look on.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

Definitely! It's a good way to integrate PGO into the project.

markmandel commented 11 months ago

Definitely! It's a good way to integrate PGO into the project.

If you would love to show us how it's done 😃 @zamazan4ik - would definitely love your help in this area for sure. Seems like an easy win to me 👍🏻

zamazan4ik commented 11 months ago

Sure. You can create an additional LTO-specific profile in Cargo.toml like it's done in G3 project. And then from the Makefile just call building Quilkin with specific Cargo profile.