zamazan4ik commented 9 months ago

Hi!

Recently I started evaluating using Profile-Guided Optimization (PGO) for optimizing different kinds of software - all my current results are available in my GitHub repo. Since PGO helps with achieving better runtime efficiency in many cases, I decided to perform some PGO tests on Lace. I performed some benchmarks and want to share my results here.

Test environment

Fedora 39
Linux kernel 6.6.13
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.75
Lace version: the latest for now from the master branch on commit 66e5a67688c76437a9ae5ec1bcadc4c1d0c7b604
Disabled Turbo boost (for more stable results across benchmark runs)

Benchmark

For benchmarking purposes, I use two things:

Built-in benchmarks
Manual lace-cli invocations with manual time measurements.

Built-in benchmarks are invoked with cargo bench --all-features --workspace. PGO instrumentation phase on benchmarks is done with cargo pgo bench -- --all-features --workspace. PGO optimization phase is done with cargo pgo optimize bench -- --all-features --workspace.

For lace-cli Release build is done with cargo build --release. PGO instrumented build is done with cargo pgo build. PGO optimized build is done with cargo pgo optimized build. The PGO training phase is done with LLVM_PROFILE_FILE=/home/zamazan4ik/open_source/lace/cli/target/pgo-profiles/lace_%m_%p.profraw ./lace_instrumented run --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace (see "Results" section for more details about using different training sets and its impact on the actual performance numbers).

For lace-cli I use taskset -c 0 to reduce an OS scheduler impact on the result. The seed is fixed for the same purpose.

All PGO optimization steps are done with cargo-pgo tool.

Results

At first, here are the results for the built-in benchmarks:

Release: https://gist.github.com/zamazan4ik/d4bc743b2beb7e6f4bcf8c3c7fcab41b
PGO optimized compared to Release: https://gist.github.com/zamazan4ik/27734cb744ce2cd57e12ad8eda95e318
(just for reference) PGO instrumentation compared to Release (you can estimate the slowdown from the instrumentation phase): https://gist.github.com/zamazan4ik/446ac5486058cb3bb9a12c100a8c3e56

According to these benchmarks, PGO helps with achieving better performance in many cases. However, as you see, in some cases the performance is regressed. It could be an expected thing since the benchmarks have different scenarios, and some scenarios can have "optimization conflicts": the same optimization decision can lead to an improvement in one scenario and to a regression in another scenario. That's why using benchmarks for the PGO training phase could be a dangerous thing. Anyway, even knowing this we see many improvements.

If we want to see more real-life scenario, I performed PGO benchmarks on lace-cli.

Release vs PGO optimized (trained on the satellites dataset) on the satellites dataset:

hyperfine --warmup 10 --min-runs 50 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.469 s ±  0.006 s    [User: 1.386 s, System: 0.063 s]
  Range (min … max):    1.464 s …  1.507 s    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.382 s ±  0.001 s    [User: 1.299 s, System: 0.064 s]
  Range (min … max):    1.380 s …  1.388 s    50 runs

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace ran
    1.06 ± 0.00 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace

Release vs PGO optimized (trained on the satellites dataset) on the animals dataset:

hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     682.7 ms ±   3.6 ms    [User: 608.5 ms, System: 65.8 ms]
  Range (min … max):   680.4 ms … 706.4 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     652.4 ms ±   2.9 ms    [User: 579.8 ms, System: 64.3 ms]
  Range (min … max):   648.2 ms … 672.5 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.01 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

Just for reference, here is the slowdown from PGO instrumentation:

hyperfine --warmup 5 --min-runs 10 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     681.7 ms ±   0.7 ms    [User: 608.1 ms, System: 65.8 ms]
  Range (min … max):   681.0 ms … 683.1 ms    10 runs

Benchmark 2: taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     841.0 ms ±   4.7 ms    [User: 754.1 ms, System: 77.3 ms]
  Range (min … max):   835.2 ms … 853.1 ms    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.23 ± 0.01 times faster than taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

I decided to test one more thing - how much performance differs if we use different PGO training sets? So here we go.

PGO optimized (trained on the satellites dataset) vs PGO optimized (trained on the animals dataset) on the animals dataset:

hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     653.0 ms ±   1.4 ms    [User: 579.7 ms, System: 65.4 ms]
  Range (min … max):   649.4 ms … 655.9 ms    100 runs

Benchmark 2: taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     622.7 ms ±   1.8 ms    [User: 550.3 ms, System: 64.1 ms]
  Range (min … max):   618.6 ms … 626.3 ms    100 runs

Summary
  taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.00 times faster than taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

As you see, improvement is measurable (5% is a good improvement).

Concluding all the results above, I can say that PGO helps to achieve better performance with Lace.

For anyone who cares about the binary size, I also did some measurements on lace-cli:

Release: 28184240 byte
PGO optimized (animals dataset): 28085792 byte
PGO optimized (satellites dataset): 27785576 byte
PGO instrumented: 116176688 byte

Possible further steps

I can suggest the following things to consider:

Perform more PGO benchmarks on Lace. If it shows improvements - add a note to the documentation about possible improvements in Lace performance with PGO (I guess somewhere in the README file will be enough).
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Lace according to their workloads.
Optimize pre-built binaries (if any)

Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual LTO and PGO.

Here are some examples of how PGO optimization is integrated into other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I would be happy to answer all your questions about PGO! Much more materials about PGO (actual performance numbers across a lot of other projects, PGO state across an ecosystem, PGO traps, and tricky details) can be found in https://github.com/zamazan4ik/awesome-pgo

schmidmt commented 9 months ago

Hi @zamazan4ik,

Thanks for bringing PGO to our attention as a way to improve performance.

While we have some experience with PGO, most of our experience is with algorithmic improvements to gain performance. Would you like to add a section to our mdbook outlining some of the methods you mentioned? We'd be happy to help with lace and what we've learned about how people use it.

Thanks again, we appreciate it.

zamazan4ik commented 9 months ago

Would you like to add a section to our mdbook outlining some of the methods you mentioned?

What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.

Swandog commented 9 months ago

What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.

Specifically the code under book in this repo: https://github.com/promised-ai/lace/tree/master/book

zamazan4ik commented 9 months ago

Specifically the code under book in this repo: https://github.com/promised-ai/lace/tree/master/book

Thanks for the link! I think we can do something like it's done for other projects. Some examples:

ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
GCC: Official docs, section "Building with profile feedback"
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
Rustc: https://rustc-dev-guide.rust-lang.org/building/optimized-build.html#profile-guided-optimization
tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

I need to think about it and, maybe, will be able to create a PR for the book.

schmidmt commented 9 months ago

Thanks, @zamazan4ik; we appreciate the contribution.

promised-ai / lace

Profile-Guided Optimization (PGO) results #172

Test environment

Benchmark

Results

Possible further steps