Open zamazan4ik opened 1 month ago
Thanks for the pointers!
Yes, we probably could benefit from PGO, since the IO paths are often performance critical and (somewhat surprisingly) CPU bound. I am especially wondering if PGO will allow us to keep the hot paths optimized for speed while letting the colder paths be optimized for size. Binary size is important to us because in a paravisor configuration, each VM has its own copy of the binary in memory, and with lots of small VMs this can add up.
We found that optimizing the full binary for size reduced the size by a few megabytes, but it also reduced networking performance in our Azure Boost compatibility scenarios by a significant amount. So we're still optimizing the full binary for speed at the moment. Just optimizing specific crates for size didn't seem to help, for reasons I don't fully understand yet. Maybe PGO can help us split the difference.
I am especially wondering if PGO will allow us to keep the hot paths optimized for speed while letting the colder paths be optimized for size. Binary size is important to us because in a paravisor configuration, each VM has its own copy of the binary in memory, and with lots of small VMs this can add up.
Yep, PGO definitely can help with that! Actually, it's the main benefit of PGO: optimizing hot paths for speed (the most common thing - inline them harder) and optimizing cold paths for size (inline them less frequently).
We found that optimizing the full binary for size reduced the size by a few megabytes, but it also reduced networking performance in our Azure Boost compatibility scenarios by a significant amount. So we're still optimizing the full binary for speed at the moment. Just optimizing specific crates for size didn't seem to help, for reasons I don't fully understand yet. Maybe PGO can help us split the difference.
I am almost sure that the root of such behavior is inlining. Then you optimize for size, and the compiler tries to inline as little as possible. yes, it helps with the binary size optimization but each non-inlined function introduces an additional performance cost for calling such a function (call
jump, miss I-cache if the called function is not in the I-cache right now). PGO can help with inlining "right" functions, and Post-Link Optimization (PLO) with tools like LLVM BOLT can help with reducing I-cache misses.
Hi!
A few days ago I found an article about OpenVMM on Reddit - as far as I see, the project tries to deliver peak performance. Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects including many projects - the results are available in the awesome-pgo repo. Since PGO has helped in many cases, I think it would be a good idea to try optimizing OpenVMM by applying PGO to it.
I can suggest the following things to do:
For the Rust projects, I suggest trying to start with cargo-pgo.
Here you can find different materials about PGO: benchmarks in different software, examples of how PGO is already integrated with different projects, PGO support in multiple Rust compilers, and some PGO-related pieces of advice.
After PGO, I suggest evaluating the LLVM BOLT optimizer - it can give more aggressive optimizations even after PGO. However, starting with regular PGO will be easier to do.
I would be happy to answer all your questions about PGO!
P.S. It's just an improvement idea for the project. I created the Issue since Discussions are disabled for the repository.