promised-ai / lace

A probabalistic ML tool for science
Other
105 stars 8 forks source link

Profile-Guided Optimization (PGO) results #172

Open zamazan4ik opened 9 months ago

zamazan4ik commented 9 months ago

Hi!

Recently I started evaluating using Profile-Guided Optimization (PGO) for optimizing different kinds of software - all my current results are available in my GitHub repo. Since PGO helps with achieving better runtime efficiency in many cases, I decided to perform some PGO tests on Lace. I performed some benchmarks and want to share my results here.

Test environment

Benchmark

For benchmarking purposes, I use two things:

Built-in benchmarks are invoked with cargo bench --all-features --workspace. PGO instrumentation phase on benchmarks is done with cargo pgo bench -- --all-features --workspace. PGO optimization phase is done with cargo pgo optimize bench -- --all-features --workspace.

For lace-cli Release build is done with cargo build --release. PGO instrumented build is done with cargo pgo build. PGO optimized build is done with cargo pgo optimized build. The PGO training phase is done with LLVM_PROFILE_FILE=/home/zamazan4ik/open_source/lace/cli/target/pgo-profiles/lace_%m_%p.profraw ./lace_instrumented run --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace (see "Results" section for more details about using different training sets and its impact on the actual performance numbers).

For lace-cli I use taskset -c 0 to reduce an OS scheduler impact on the result. The seed is fixed for the same purpose.

All PGO optimization steps are done with cargo-pgo tool.

Results

At first, here are the results for the built-in benchmarks:

According to these benchmarks, PGO helps with achieving better performance in many cases. However, as you see, in some cases the performance is regressed. It could be an expected thing since the benchmarks have different scenarios, and some scenarios can have "optimization conflicts": the same optimization decision can lead to an improvement in one scenario and to a regression in another scenario. That's why using benchmarks for the PGO training phase could be a dangerous thing. Anyway, even knowing this we see many improvements.

If we want to see more real-life scenario, I performed PGO benchmarks on lace-cli.

Release vs PGO optimized (trained on the satellites dataset) on the satellites dataset:

hyperfine --warmup 10 --min-runs 50 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.469 s ±  0.006 s    [User: 1.386 s, System: 0.063 s]
  Range (min … max):    1.464 s …  1.507 s    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
  Time (mean ± σ):      1.382 s ±  0.001 s    [User: 1.299 s, System: 0.064 s]
  Range (min … max):    1.380 s …  1.388 s    50 runs

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace ran
    1.06 ± 0.00 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace

Release vs PGO optimized (trained on the satellites dataset) on the animals dataset:

hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     682.7 ms ±   3.6 ms    [User: 608.5 ms, System: 65.8 ms]
  Range (min … max):   680.4 ms … 706.4 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     652.4 ms ±   2.9 ms    [User: 579.8 ms, System: 64.3 ms]
  Range (min … max):   648.2 ms … 672.5 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_optimized run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.01 times faster than taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

Just for reference, here is the slowdown from PGO instrumentation:

hyperfine --warmup 5 --min-runs 10 'taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     681.7 ms ±   0.7 ms    [User: 608.1 ms, System: 65.8 ms]
  Range (min … max):   681.0 ms … 683.1 ms    10 runs

Benchmark 2: taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     841.0 ms ±   4.7 ms    [User: 754.1 ms, System: 77.3 ms]
  Range (min … max):   835.2 ms … 853.1 ms    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./lace_release run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.23 ± 0.01 times faster than taskset -c 0 ./lace_instrumented run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

I decided to test one more thing - how much performance differs if we use different PGO training sets? So here we go.

PGO optimized (trained on the satellites dataset) vs PGO optimized (trained on the animals dataset) on the animals dataset:

hyperfine --warmup 30 --min-runs 100 'taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace' 'taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace'
Benchmark 1: taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     653.0 ms ±   1.4 ms    [User: 579.7 ms, System: 65.4 ms]
  Range (min … max):   649.4 ms … 655.9 ms    100 runs

Benchmark 2: taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace
  Time (mean ± σ):     622.7 ms ±   1.8 ms    [User: 550.3 ms, System: 64.1 ms]
  Range (min … max):   618.6 ms … 626.3 ms    100 runs

Summary
  taskset -c 0 ./lace_optimized_animals run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace ran
    1.05 ± 0.00 times faster than taskset -c 0 ./lace_optimized_satellites run --seed 42 --csv ../resources/datasets/animals/data.csv --n-iters 100 result.lace

As you see, improvement is measurable (5% is a good improvement).

Concluding all the results above, I can say that PGO helps to achieve better performance with Lace.

For anyone who cares about the binary size, I also did some measurements on lace-cli:

Possible further steps

I can suggest the following things to consider:

Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual LTO and PGO.

Here are some examples of how PGO optimization is integrated into other projects:

I would be happy to answer all your questions about PGO! Much more materials about PGO (actual performance numbers across a lot of other projects, PGO state across an ecosystem, PGO traps, and tricky details) can be found in https://github.com/zamazan4ik/awesome-pgo

schmidmt commented 9 months ago

Hi @zamazan4ik,

Thanks for bringing PGO to our attention as a way to improve performance.

While we have some experience with PGO, most of our experience is with algorithmic improvements to gain performance. Would you like to add a section to our mdbook outlining some of the methods you mentioned? We'd be happy to help with lace and what we've learned about how people use it.

Thanks again, we appreciate it.

zamazan4ik commented 9 months ago

Would you like to add a section to our mdbook outlining some of the methods you mentioned?

What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.

Swandog commented 9 months ago

What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.

Specifically the code under book in this repo: https://github.com/promised-ai/lace/tree/master/book

zamazan4ik commented 9 months ago

Specifically the code under book in this repo: https://github.com/promised-ai/lace/tree/master/book

Thanks for the link! I think we can do something like it's done for other projects. Some examples:

I need to think about it and, maybe, will be able to create a PR for the book.

schmidmt commented 9 months ago

Thanks, @zamazan4ik; we appreciate the contribution.