Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT

zamazan4ik commented 1 year ago

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects (including many compilers and compiler-like workloads like static analyzers, code formatters, etc.) - the results are available here. Since oxc is a performance-oriented project, I think PGO can help here too.

We need to evaluate PGO applicability to oxc tooling. And if it helps to achieve better performance - add a note to the documentation about that. In this case, users and maintainers will be aware of another optimization opportunity for oxc. Also, PGO integration into the build scripts can help users and maintainers easily apply PGO for their own workloads. Even distributed by Oxc binaries can be pre-optimized with PGO on a generic-enough sample workload (e.g. rustc already does it).

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.

For the Rust projects, I recommend starting with cargo-pgo - it makes easier PGO optimization in many cases.

Boshen commented 1 year ago

Thank you for the great work around PGO and the instructions.

I verified that this project builds with PGO by following the instructions from cargo-pgo. I'm going to look at setting up the test data when I find the time.

One quick question: Is the test data good enough if I plan to run my current test suite (which has lots of conformance tests), and then on some real projects.

zamazan4ik commented 1 year ago

One quick question: Is the test data good enough if I plan to run my current test suite (which has lots of conformance tests), and then on some real projects.

That's a good question. Usually, the test data (mostly unit-tests, I guess) is not a good candidate for collecting PGO profiles since the tests try to cover all cases (including rare corner cases), but for PGO you are interested in optimizing a "happy" path of the program.

From my experience, the good candidates for collecting PGO profiles are the following workloads:

Real-life workload. Like compile Oxc tooling in instrumentation mode, try to use it on a daily basis, and then recompile Oxc with collected profiles. In this case, the profiles will represent the happy path. This way is recommended and used by multiple projects like Clang, Rustc, etc.
Benchmarks. Usually, (good) benchmarks are written for performance-critical stuff. So with running benchmarks, you get profiles for the performance-critical path of your software.

Boshen commented 1 year ago

Close as not planned for now.

I'm satisfied with the current performance of oxc, and I really don't know how to set this thing up easily 😞

valeneiko commented 5 months ago

@Boshen I just played with PGO locally, and got about 10% perf improvement across all 6 repos I tested.

Both PGO and perf comparison was done using these 6 repositories:

https://github.com/DefinitelyTyped/DefinitelyTyped

> oxlint --threads=12 --quiet -D all
# before (best of 3): Finished in 2.0s on 34897 files with 220 rules using 12 threads.
# after (worst of 3): Finished in 1.8s on 34897 files with 220 rules using 12 threads.

https://github.com/toeverything/affine

> oxlint --threads=12 --deny-warnings -c oxlint.json --import-plugin -D correctness -D perf
# before (best of 3): Finished in 136ms on 1525 files with 96 rules using 12 threads.
# after (worst of 3): Finished in 125ms on 1525 files with 96 rules using 12 threads.

https://github.com/napi-rs/napi-rs

> oxlint --threads=12 --deny-warnings --ignore-path=.oxlintignore --import-plugin -D correctness -A no-export
# before (best of 3): Finished in 18ms on 144 files with 93 rules using 12 threads.
# after (worst of 3): Finished in 17ms on 144 files with 93 rules using 12 threads.

https://github.com/preactjs/preact

> oxlint --threads=12 --deny-warnings -c oxlint.json oxlint src test debug compat hooks test-utils
# before (best of 3): Finished in 18ms on 159 files with 77 rules using 12 threads.
# after (worst of 3): Finished in 16ms on 169 files with 77 rules using 12 threads.

https://github.com/rolldown/rolldown

> oxlint --threads=12 --deny-warnings --ignore-path=.oxlintignore --import-plugin
# before (best of 3): Finished in 14ms on 124 files with 93 rules using 12 threads.
# after (worst of 3): Finished in 13ms on 124 files with 93 rules using 12 threads.

https://github.com/microsoft/vscode

> oxlint --threads=12 --quiet -D all
# before (best of 3): Finished in 919ms on 4856 files with 220 rules using 12 threads.
# after (worst of 3): Finished in 887ms on 4856 files with 220 rules using 12 threads.

And here is what I added to justfile to do it:

ecosystem_dir := "C:/source/ecosystem"
oxlint_bin := "C:/source/rust/oxc/target/release/oxlint.exe"
threads := "12"

pgo_data_dir := "C:/source/rust/oxc/pgo-data"
llvm_profdata_bin := "~/.rustup/toolchains/1.78.0-x86_64-pc-windows-msvc/lib/rustlib/x86_64-pc-windows-msvc/bin/llvm-profdata.exe"

build-pgo:
  just build-pgo-init
  just oxlint_bin=C:/source/rust/oxc/target/x86_64-pc-windows-msvc/release/oxlint.exe ecosystem
  {{llvm_profdata_bin}} merge -o {{pgo_data_dir}}/merged.profdata {{pgo_data_dir}}
  just build-pgo-final

build-pgo-init $RUSTFLAGS="-Cprofile-generate=C:/source/rust/oxc/pgo-data":
  cargo build --release -p oxc_cli --bin oxlint --features allocator --target x86_64-pc-windows-msvc

build-pgo-final $RUSTFLAGS="-Cprofile-use=C:/source/rust/oxc/pgo-data/merged.profdata -Cllvm-args=-pgo-warn-missing-function":
  cargo build --release -p oxc_cli --bin oxlint --features allocator --target x86_64-pc-windows-msvc

ecosystem:
  -cd "{{ecosystem_dir}}/DefinitelyTyped" && {{oxlint_bin}} --threads={{threads}} --quiet -D all
  cd "{{ecosystem_dir}}/affine" && {{oxlint_bin}} --threads={{threads}} --deny-warnings -c oxlint.json --import-plugin -D correctness -D perf
  cd "{{ecosystem_dir}}/napi-rs" && {{oxlint_bin}} --threads={{threads}} --deny-warnings --ignore-path=.oxlintignore --import-plugin -D correctness -A no-export
  cd "{{ecosystem_dir}}/preact" && {{oxlint_bin}} --threads={{threads}} --deny-warnings -c oxlint.json oxlint src test debug compat hooks test-utils
  cd "{{ecosystem_dir}}/rolldown" && {{oxlint_bin}} --threads={{threads}} --deny-warnings --ignore-path=.oxlintignore --import-plugin
  cd "{{ecosystem_dir}}/vscode" && {{oxlint_bin}} --threads={{threads}} --quiet -D all

Paths are specific to my environment, but should give you a good idea how to do it. You need to have all repos cloned into the ecosystem_dir, and have llvm component installed:

rustup component add llvm-tools-preview

Rust toolchain and target is also hard-coded to Windows in my example, so would need to be updated.

Just a heads up, running ecosystem with instrumented build takes a few minutes. So this will basically 3x the compile time.

overlookmotel commented 5 months ago

Given @valeneiko's impressive speed-up findings, I think this is worth considering again.

It may not be viable to integrate with our CI setup, or the compile time increase may be a blocker, but re-opening this issue so we can at least consider it.

@valeneiko Can I ask a favour? Would you be able to run the same kind of test on the parser, and see what (if any) speed-up it gets?

valeneiko commented 5 months ago

@overlookmotel if you can share the command to run the parser. I can do it tomorrow.

Boshen commented 5 months ago

This is really amazing work! I can see some good potential if people are building a service for really high intense work.

As for oxlint, I'm unsure about adding this costly final release build step for the 10% performance improvement.

valeneiko commented 5 months ago

Agree that this build step is too much for regular CI. But for published releases that don't happen that often, might still be worth it.

overlookmotel commented 5 months ago

By the way, only reason I've not come back on your request for the command to run the parser is that there isn't one!

The parser is only exposed as a Rust crate (and an NPM package, but we shouldn't use that as it's slow due to cost of serializing the AST to pass it from Rust to JS).

So I'll need to build one for you! If you don't want to wait for me, you know Rust, and are willing, you could probably knock one up yourself pretty quickly. But please feel free to say "no, I don't have time for that". Very much appreciate you testing this out and putting it on our radar, and am very willing to do what I can to assist you in testing it further. Just am tied up right now so will take me a few days to get to it.

Agree that this build step is too much for regular CI. But for published releases that don't happen that often, might still be worth it.

I tend to agree with you on this.

valeneiko commented 5 months ago

@overlookmotel the results are below. Between 0% and 20% faster. First number is wall time. Second one is cummulative time in just the parse function.

https://github.com/DefinitelyTyped/DefinitelyTyped"

> oxparse --threads=12
# before (best of 3): Finished in 1.1s (1.1s) on 34897 files using 12 threads.
# after (worst of 3): Finished in 1.1s (880ms) on 34897 files using 12 threads.

https://github.com/toeverything/affine"

> oxparse --threads=12
# before (best of 3): Finished in 52ms (45ms) on 1525 files using 12 threads.
# after (worst of 3): Finished in 52ms (40ms) on 1525 files using 12 threads.

https://github.com/napi-rs/napi-rs"

> oxparse --threads=12 --ignore-path=.oxlintignore
# before (best of 3): Finished in 9ms (7ms) on 144 files using 12 threads.
# after (worst of 3): Finished in 9ms (7ms) on 144 files using 12 threads.

https://github.com/preactjs/preact"

> oxparse --threads=12 oxlint src test debug compat hooks test-utils
# before (best of 3): Finished in 7ms (17ms) on 169 files using 12 threads.
# after (worst of 3): Finished in 7ms (16ms) on 169 files using 12 threads.

https://github.com/rolldown/rolldown"

> oxparse --threads=12 --ignore-path=.oxlintignore
# before (best of 3): Finished in 9ms (4ms) on 121 files using 12 threads.
# after (worst of 3): Finished in 9ms (4ms) on 121 files using 12 threads.

https://github.com/microsoft/vscode"

> oxparse --threads=12
# before (best of 3): Finished in 203ms (467ms) on 4856 files using 12 threads.
# after (worst of 3): Finished in 196ms (376ms) on 4856 files using 12 threads.

You can find the source here:

https://github.com/valeneiko/oxc/pull/1

overlookmotel commented 5 months ago

@valeneiko Amazing! Thanks loads for doing this.

I suspect the ones which show 0% improvement just don't run long enough for the speed-up to show with a millisecond measurement granularity.

zamazan4ik commented 5 months ago

I suspect the ones which show 0% improvement just don't run long enough for the speed-up to show with a millisecond measurement granularity.

I can suggest you run such benchmarks with hyperfine - it will allow you to get the results with the required granularity.

overlookmotel commented 5 months ago

The reason I was interested in the parser is that it is absolutely stuffed full of branching, so there's a lot of room there for incorrect branch prediction to incur costs. I am guessing that a lot of the 10% speed boost that PGO gives the linter comes from PGO reducing branch mis-prediction in the parser (or re-ordering branches so that the commonly taken path is the default). The results above seem to at least partially confirm that hypothesis.

The tricky thing is that the parser is provided as a library, not a binary. So, if I've understood correctly, for external consumers it'd be on them to implement PGO - it's not something we can do this end in a library. Have I understood that right?

What we could do in the parser is figure out what changes PGO is making to the parser's codegen, and try to replicate the largest gains by manually guiding the non-PGO compiler to do the same thing with #[cold] / #[inline(never)] hints.

Is there any way to get a picture of what PGO is doing to the parser, in a format which is feasible to interpret?

2nd question: Is there any chance we're overfitting the data, if the files we're "training" PGO on are the same files that we're then measuring the gain of using PGO on?

valeneiko commented 5 months ago

If we are publishing a pre-built library, we can still PGO optimize it. We just need something to dynamically link to it (instead of the usual static linking).

But yes, if people are building the lib from source, they would need to do PGO optimisation on their side.

zamazan4ik commented 5 months ago

Is there any way to get a picture of what PGO is doing to the parser, in a format which is feasible to interpret?

It's possible to extract some statistics about the most frequently-executed functions with llvm-profdata tool (docs) e.g. via the show command with the --topn switch. From this information, you can guess about function hotness and perform manual inlining and hot/cold things (if you want).

However, if you want to get more insights about the performed optimizations, you need to use a disassembler and take a look at the generated assembly. Then you can try to figure out the difference between PGOed and non-PGOed versions. This way will take more time to implement, I suppose.

valeneiko commented 5 months ago

@overlookmotel I just discovered -C remark=... flag in rust compiler that tells LLVM to print diagnostics for any applied / missed optimizations.

When compiling with PGO we can also tell LLVM to print out the stats for branch probalities by adding these flags to $RUSTFLAGS of the final build:

-Cremark=pgo-instrumentation -Cremark=pgo-icall-prom -Cremark=pgo-memop-opt -Cremark=pgo-force-function-attrs -Cllvm-args=--pgo-emit-branch-prob

You can find a list of options to pass to -Cremark= here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Passes/PassRegistry.def (passing all is also an option, but that produces a lot of data that is difficult to make sense of).

There is also an option that prints basic block frequency: -Cllvm-args=--pgo-view-counts=text.

# To discover these flags I have used:

rustc -Cllvm-args="--help-list-hidden"

# This whole idea was inspired by this lecture:

https://www.youtube.com/watch?v=ulJm7_aTiQM

oxc-project / oxc

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #812