Open zamazan4ik opened 9 months ago
Hi @zamazan4ik,
Thanks for bringing PGO to our attention as a way to improve performance.
While we have some experience with PGO, most of our experience is with algorithmic improvements to gain performance. Would you like to add a section to our mdbook outlining some of the methods you mentioned? We'd be happy to help with lace
and what we've learned about how people use it.
Thanks again, we appreciate it.
Would you like to add a section to our mdbook outlining some of the methods you mentioned?
What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.
What exactly mdbook do you mean? I think it can be doable from my side to contribute PGO-related information to it.
Specifically the code under book
in this repo: https://github.com/promised-ai/lace/tree/master/book
Specifically the code under book in this repo: https://github.com/promised-ai/lace/tree/master/book
Thanks for the link! I think we can do something like it's done for other projects. Some examples:
I need to think about it and, maybe, will be able to create a PR for the book.
Thanks, @zamazan4ik; we appreciate the contribution.
Hi!
Recently I started evaluating using Profile-Guided Optimization (PGO) for optimizing different kinds of software - all my current results are available in my GitHub repo. Since PGO helps with achieving better runtime efficiency in many cases, I decided to perform some PGO tests on Lace. I performed some benchmarks and want to share my results here.
Test environment
master
branch on commit66e5a67688c76437a9ae5ec1bcadc4c1d0c7b604
Benchmark
For benchmarking purposes, I use two things:
lace-cli
invocations with manual time measurements.Built-in benchmarks are invoked with
cargo bench --all-features --workspace
. PGO instrumentation phase on benchmarks is done withcargo pgo bench -- --all-features --workspace
. PGO optimization phase is done withcargo pgo optimize bench -- --all-features --workspace
.For
lace-cli
Release build is done withcargo build --release
. PGO instrumented build is done withcargo pgo build
. PGO optimized build is done withcargo pgo optimized build
. The PGO training phase is done withLLVM_PROFILE_FILE=/home/zamazan4ik/open_source/lace/cli/target/pgo-profiles/lace_%m_%p.profraw ./lace_instrumented run --csv ../resources/datasets/satellites/data.csv --n-iters 100 result.lace
(see "Results" section for more details about using different training sets and its impact on the actual performance numbers).For
lace-cli
I usetaskset -c 0
to reduce an OS scheduler impact on the result. Theseed
is fixed for the same purpose.All PGO optimization steps are done with cargo-pgo tool.
Results
At first, here are the results for the built-in benchmarks:
According to these benchmarks, PGO helps with achieving better performance in many cases. However, as you see, in some cases the performance is regressed. It could be an expected thing since the benchmarks have different scenarios, and some scenarios can have "optimization conflicts": the same optimization decision can lead to an improvement in one scenario and to a regression in another scenario. That's why using benchmarks for the PGO training phase could be a dangerous thing. Anyway, even knowing this we see many improvements.
If we want to see more real-life scenario, I performed PGO benchmarks on
lace-cli
.Release vs PGO optimized (trained on the
satellites
dataset) on thesatellites
dataset:Release vs PGO optimized (trained on the
satellites
dataset) on theanimals
dataset:Just for reference, here is the slowdown from PGO instrumentation:
I decided to test one more thing - how much performance differs if we use different PGO training sets? So here we go.
PGO optimized (trained on the
satellites
dataset) vs PGO optimized (trained on theanimals
dataset) on theanimals
dataset:As you see, improvement is measurable (5% is a good improvement).
Concluding all the results above, I can say that PGO helps to achieve better performance with Lace.
For anyone who cares about the binary size, I also did some measurements on
lace-cli
:28184240
byteanimals
dataset):28085792
bytesatellites
dataset):27785576
byte116176688
bytePossible further steps
I can suggest the following things to consider:
Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual LTO and PGO.
Here are some examples of how PGO optimization is integrated into other projects:
configure
scriptI would be happy to answer all your questions about PGO! Much more materials about PGO (actual performance numbers across a lot of other projects, PGO state across an ecosystem, PGO traps, and tricky details) can be found in https://github.com/zamazan4ik/awesome-pgo