Closed zamazan4ik closed 2 months ago
Thank you for the great work around PGO and the instructions.
I verified that this project builds with PGO by following the instructions from cargo-pgo
. I'm going to look at setting up the test data when I find the time.
One quick question: Is the test data good enough if I plan to run my current test suite (which has lots of conformance tests), and then on some real projects.
One quick question: Is the test data good enough if I plan to run my current test suite (which has lots of conformance tests), and then on some real projects.
That's a good question. Usually, the test data (mostly unit-tests, I guess) is not a good candidate for collecting PGO profiles since the tests try to cover all cases (including rare corner cases), but for PGO you are interested in optimizing a "happy" path of the program.
From my experience, the good candidates for collecting PGO profiles are the following workloads:
Close as not planned for now.
I'm satisfied with the current performance of oxc, and I really don't know how to set this thing up easily 😞
@Boshen I just played with PGO locally, and got about 10% perf improvement across all 6 repos I tested.
Both PGO and perf comparison was done using these 6 repositories:
> oxlint --threads=12 --quiet -D all
# before (best of 3): Finished in 2.0s on 34897 files with 220 rules using 12 threads.
# after (worst of 3): Finished in 1.8s on 34897 files with 220 rules using 12 threads.
> oxlint --threads=12 --deny-warnings -c oxlint.json --import-plugin -D correctness -D perf
# before (best of 3): Finished in 136ms on 1525 files with 96 rules using 12 threads.
# after (worst of 3): Finished in 125ms on 1525 files with 96 rules using 12 threads.
> oxlint --threads=12 --deny-warnings --ignore-path=.oxlintignore --import-plugin -D correctness -A no-export
# before (best of 3): Finished in 18ms on 144 files with 93 rules using 12 threads.
# after (worst of 3): Finished in 17ms on 144 files with 93 rules using 12 threads.
> oxlint --threads=12 --deny-warnings -c oxlint.json oxlint src test debug compat hooks test-utils
# before (best of 3): Finished in 18ms on 159 files with 77 rules using 12 threads.
# after (worst of 3): Finished in 16ms on 169 files with 77 rules using 12 threads.
> oxlint --threads=12 --deny-warnings --ignore-path=.oxlintignore --import-plugin
# before (best of 3): Finished in 14ms on 124 files with 93 rules using 12 threads.
# after (worst of 3): Finished in 13ms on 124 files with 93 rules using 12 threads.
> oxlint --threads=12 --quiet -D all
# before (best of 3): Finished in 919ms on 4856 files with 220 rules using 12 threads.
# after (worst of 3): Finished in 887ms on 4856 files with 220 rules using 12 threads.
And here is what I added to justfile
to do it:
ecosystem_dir := "C:/source/ecosystem"
oxlint_bin := "C:/source/rust/oxc/target/release/oxlint.exe"
threads := "12"
pgo_data_dir := "C:/source/rust/oxc/pgo-data"
llvm_profdata_bin := "~/.rustup/toolchains/1.78.0-x86_64-pc-windows-msvc/lib/rustlib/x86_64-pc-windows-msvc/bin/llvm-profdata.exe"
build-pgo:
just build-pgo-init
just oxlint_bin=C:/source/rust/oxc/target/x86_64-pc-windows-msvc/release/oxlint.exe ecosystem
{{llvm_profdata_bin}} merge -o {{pgo_data_dir}}/merged.profdata {{pgo_data_dir}}
just build-pgo-final
build-pgo-init $RUSTFLAGS="-Cprofile-generate=C:/source/rust/oxc/pgo-data":
cargo build --release -p oxc_cli --bin oxlint --features allocator --target x86_64-pc-windows-msvc
build-pgo-final $RUSTFLAGS="-Cprofile-use=C:/source/rust/oxc/pgo-data/merged.profdata -Cllvm-args=-pgo-warn-missing-function":
cargo build --release -p oxc_cli --bin oxlint --features allocator --target x86_64-pc-windows-msvc
ecosystem:
-cd "{{ecosystem_dir}}/DefinitelyTyped" && {{oxlint_bin}} --threads={{threads}} --quiet -D all
cd "{{ecosystem_dir}}/affine" && {{oxlint_bin}} --threads={{threads}} --deny-warnings -c oxlint.json --import-plugin -D correctness -D perf
cd "{{ecosystem_dir}}/napi-rs" && {{oxlint_bin}} --threads={{threads}} --deny-warnings --ignore-path=.oxlintignore --import-plugin -D correctness -A no-export
cd "{{ecosystem_dir}}/preact" && {{oxlint_bin}} --threads={{threads}} --deny-warnings -c oxlint.json oxlint src test debug compat hooks test-utils
cd "{{ecosystem_dir}}/rolldown" && {{oxlint_bin}} --threads={{threads}} --deny-warnings --ignore-path=.oxlintignore --import-plugin
cd "{{ecosystem_dir}}/vscode" && {{oxlint_bin}} --threads={{threads}} --quiet -D all
Paths are specific to my environment, but should give you a good idea how to do it. You need to have all repos cloned into the ecosystem_dir
, and have llvm component installed:
rustup component add llvm-tools-preview
Rust toolchain and target is also hard-coded to Windows in my example, so would need to be updated.
Just a heads up, running ecosystem
with instrumented build takes a few minutes. So this will basically 3x the compile time.
Given @valeneiko's impressive speed-up findings, I think this is worth considering again.
It may not be viable to integrate with our CI setup, or the compile time increase may be a blocker, but re-opening this issue so we can at least consider it.
@valeneiko Can I ask a favour? Would you be able to run the same kind of test on the parser, and see what (if any) speed-up it gets?
@overlookmotel if you can share the command to run the parser. I can do it tomorrow.
This is really amazing work! I can see some good potential if people are building a service for really high intense work.
As for oxlint, I'm unsure about adding this costly final release build step for the 10% performance improvement.
Agree that this build step is too much for regular CI. But for published releases that don't happen that often, might still be worth it.
By the way, only reason I've not come back on your request for the command to run the parser is that there isn't one!
The parser is only exposed as a Rust crate (and an NPM package, but we shouldn't use that as it's slow due to cost of serializing the AST to pass it from Rust to JS).
So I'll need to build one for you! If you don't want to wait for me, you know Rust, and are willing, you could probably knock one up yourself pretty quickly. But please feel free to say "no, I don't have time for that". Very much appreciate you testing this out and putting it on our radar, and am very willing to do what I can to assist you in testing it further. Just am tied up right now so will take me a few days to get to it.
Agree that this build step is too much for regular CI. But for published releases that don't happen that often, might still be worth it.
I tend to agree with you on this.
@overlookmotel the results are below. Between 0% and 20% faster. First number is wall time. Second one is cummulative time in just the parse
function.
> oxparse --threads=12
# before (best of 3): Finished in 1.1s (1.1s) on 34897 files using 12 threads.
# after (worst of 3): Finished in 1.1s (880ms) on 34897 files using 12 threads.
> oxparse --threads=12
# before (best of 3): Finished in 52ms (45ms) on 1525 files using 12 threads.
# after (worst of 3): Finished in 52ms (40ms) on 1525 files using 12 threads.
> oxparse --threads=12 --ignore-path=.oxlintignore
# before (best of 3): Finished in 9ms (7ms) on 144 files using 12 threads.
# after (worst of 3): Finished in 9ms (7ms) on 144 files using 12 threads.
> oxparse --threads=12 oxlint src test debug compat hooks test-utils
# before (best of 3): Finished in 7ms (17ms) on 169 files using 12 threads.
# after (worst of 3): Finished in 7ms (16ms) on 169 files using 12 threads.
> oxparse --threads=12 --ignore-path=.oxlintignore
# before (best of 3): Finished in 9ms (4ms) on 121 files using 12 threads.
# after (worst of 3): Finished in 9ms (4ms) on 121 files using 12 threads.
> oxparse --threads=12
# before (best of 3): Finished in 203ms (467ms) on 4856 files using 12 threads.
# after (worst of 3): Finished in 196ms (376ms) on 4856 files using 12 threads.
You can find the source here:
@valeneiko Amazing! Thanks loads for doing this.
I suspect the ones which show 0% improvement just don't run long enough for the speed-up to show with a millisecond measurement granularity.
I suspect the ones which show 0% improvement just don't run long enough for the speed-up to show with a millisecond measurement granularity.
I can suggest you run such benchmarks with hyperfine - it will allow you to get the results with the required granularity.
The reason I was interested in the parser is that it is absolutely stuffed full of branching, so there's a lot of room there for incorrect branch prediction to incur costs. I am guessing that a lot of the 10% speed boost that PGO gives the linter comes from PGO reducing branch mis-prediction in the parser (or re-ordering branches so that the commonly taken path is the default). The results above seem to at least partially confirm that hypothesis.
The tricky thing is that the parser is provided as a library, not a binary. So, if I've understood correctly, for external consumers it'd be on them to implement PGO - it's not something we can do this end in a library. Have I understood that right?
What we could do in the parser is figure out what changes PGO is making to the parser's codegen, and try to replicate the largest gains by manually guiding the non-PGO compiler to do the same thing with #[cold]
/ #[inline(never)]
hints.
Is there any way to get a picture of what PGO is doing to the parser, in a format which is feasible to interpret?
2nd question: Is there any chance we're overfitting the data, if the files we're "training" PGO on are the same files that we're then measuring the gain of using PGO on?
If we are publishing a pre-built library, we can still PGO optimize it. We just need something to dynamically link to it (instead of the usual static linking).
But yes, if people are building the lib from source, they would need to do PGO optimisation on their side.
Is there any way to get a picture of what PGO is doing to the parser, in a format which is feasible to interpret?
It's possible to extract some statistics about the most frequently-executed functions with llvm-profdata
tool (docs) e.g. via the show
command with the --topn
switch. From this information, you can guess about function hotness and perform manual inlining and hot/cold things (if you want).
However, if you want to get more insights about the performed optimizations, you need to use a disassembler and take a look at the generated assembly. Then you can try to figure out the difference between PGOed and non-PGOed versions. This way will take more time to implement, I suppose.
@overlookmotel I just discovered -C remark=...
flag in rust compiler that tells LLVM to print diagnostics for any applied / missed optimizations.
When compiling with PGO we can also tell LLVM to print out the stats for branch probalities by adding these flags to $RUSTFLAGS
of the final build:
-Cremark=pgo-instrumentation -Cremark=pgo-icall-prom -Cremark=pgo-memop-opt -Cremark=pgo-force-function-attrs -Cllvm-args=--pgo-emit-branch-prob
You can find a list of options to pass to -Cremark=
here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Passes/PassRegistry.def (passing all
is also an option, but that produces a lot of data that is difficult to make sense of).
There is also an option that prints basic block frequency: -Cllvm-args=--pgo-view-counts=text
.
# To discover these flags I have used:
rustc -Cllvm-args="--help-list-hidden"
# This whole idea was inspired by this lecture:
Hi!
Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects (including many compilers and compiler-like workloads like static analyzers, code formatters, etc.) - the results are available here. Since
oxc
is a performance-oriented project, I think PGO can help here too.We need to evaluate PGO applicability to
oxc
tooling. And if it helps to achieve better performance - add a note to the documentation about that. In this case, users and maintainers will be aware of another optimization opportunity foroxc
. Also, PGO integration into the build scripts can help users and maintainers easily apply PGO for their own workloads. Even distributed by Oxc binaries can be pre-optimized with PGO on a generic-enough sample workload (e.g.rustc
already does it).After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.
For the Rust projects, I recommend starting with cargo-pgo - it makes easier PGO optimization in many cases.