Profile-Guided Optimization (PGO) benchmark results

zamazan4ik commented 4 months ago

Hi!

Yesterday I read a post about Logos (I didn't know about the library before). Since the post states "Ridiculously fast" performance I came up with an idea to try to optimize the library performance with PGO (as I already did for many other applications - all the results are available here). I performed some tests and want to share the results.

Test environment

Fedora 39
Linux kernel 6.7.3
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.76
Logos version: the latest for now from the master branch on commit f6de1d7e1c35eb1453e076495904d35875b1db80
Disabled Turbo boost (for more stable results across benchmark runs)

Benchmark

Built-in benchmarks are invoked with cargo bench --workspace --all-features. PGO instrumentation phase on benchmarks is done with cargo pgo bench -- --workspace --all-features. PGO optimization phase is done with cargo pgo optimize bench -- --workspace --all-features.

All PGO optimization steps are done with cargo-pgo tool.

Results

I got the following results:

Release: https://gist.github.com/zamazan4ik/a0f86fb570a48db18c909760908b88e4
PGO optimized compared to Release: https://gist.github.com/zamazan4ik/82c654335c53e0dacf02859666affe31
(just for reference) PGO instrumented compared to Release: https://gist.github.com/zamazan4ik/c76a84cd7b165618d9cb6f227990bf4e

At least in the provided by the project benchmarks, I see measurable performance improvements. I don't know how these benchmarks are helpful for real-life performance evaluation - I just believe the project maintainers in this case.

Possible further steps

I can suggest the following things to consider:

Perform more PGO benchmarks in other scenarios. If it shows improvements - add a note to the documentation about possible improvements in the tracing library performance with PGO (I guess somewhere in the README file will be enough).

I will be happy to answer all your questions about PGO.

jeertmans commented 4 months ago

Hey @zamazan4ik, thank you for your message and comprehensive analysis!

I am new to PGO, but I guess this only optimizes binaries, not library code? How does it provide any meaningful information to improve the code?

I am asking that since Logos is a library, and PGO optimisation will likely be applied by library users, not us.

zamazan4ik commented 4 months ago

I am new to PGO, but I guess this only optimizes binaries, not library code?

Actually no - PGO works in the same way for binaries and library code. You can easily apply PGO for building a library (static/dynamic, it doesn't matter) even if you build the library separately from a binary. E.g. check the pydantic-core library and the corresponding PR: https://github.com/pydantic/pydantic-core/pull/741

How does it provide any meaningful information to improve the code?

PGO usually allows the compiler to make much more clever inlining decisions. So in theory you can compare two logos versions (without PGO and with PGO), try to figure out why PGOed version is faster, and then using these insights try to optimize the library code. In this case, you will get the performance boost without needing to integrate PGO into the build pipeline.

However, this way can be quite difficult to implement (because a lot of code needs to be analyzed). Since Logos is the library and you don't prepare any prebuilt binaries here - I can suggest at least writing somewhere in the documentation a note about using PGO for improving Logos performance. So Logos users will be aware of another additional way, how they can speed up their Logos-based applications.

jeertmans commented 4 months ago

Ok I got it, thanks! Generating PGO binaries seems a bit convoluted, but a tutorial might be interesting, especially if you notice improvements on examples like the JSON parser :-)

Actually, your link to pydantic-core's PGO process interested me a lot, but for another project ^^'

jeertmans commented 4 months ago

Labelling this as a good first issue, for the handbook.

As discussed above, this would be nice to conduct a small analysis of PGO optimization on the JSON parser example, compare performances, and document that in the book.

PGO optimization is quite well documented here: https://doc.rust-lang.org/rustc/profile-guided-optimization.html.

maciejhirsz / logos