Compiler Performance: Benchmark Definitions

michaelwoerister commented 6 years ago

The compiler performance tracking issue (#48547) defines the four main usage scenarios that we strive to support well. In order to measure how well we are actually doing, we define a benchmark for each scenario. Eventually, perf.rust-lang.org will provide a graph for each of scenario that shows how compile times develop over time.

Methodology

Compiler performance in each scenario is measured by the sum of all build times for a given set of projects. The build settings depend on the usage scenario. The set of projects should contain small, medium, and large ones.

Benchmarks

FROM-SCRATCH - Compiling a project from scratch

Compile the listed projects with each of the following combinations:

non-optimized & non-incremental
non-optimized & incremental (w/ empty cache)
optimized (-Ccodegen-units=8, no LTO) & non-incremental
optimized (no LTO) & incremental (w/ empty cache)

Projects:

style-servo
script-servo
encoding-rs
clap-rs
regex
helloworld
crates.io
hyper
html5ever
tokio-webpush-simple
inflate
syn
futures
piston-image
ripgrep
webrender
cargo
winapi
stm32f103xx

SMALL-CHANGE - Re-Compiling a project after a small change

For this scenario, we re-compile the project incrementally with a full cache after a println!() statement has been added somewhere.

non-optimized & incremental (w/ full cache)
- style-servo
- script-servo
- encoding-rs (cargo test --lib --no-run)
- clap-rs (cargo test --no-run)
- regex (cargo test --lib --no-run)
- crates.io
- syn (cargo test --no-run)
- futures (cargo test --test=all --no-run)
- tokio-webpush-simple
- ripgrep
- webrender
optimized (no LTO) & incremental (w/ full cache)
- style-servo
- script-servo
- tokio-webpush-simple
- crates.io
- ripgrep
- webrender
- cargo

RLS - Continuously re-compiling a project for the Rust Language Server

For this scenario, we run cargo check incrementally with a full cache after a println!() statement has been added somewhere.

cargo check, non-optimized & incremental (w/ full cache)

Projects:

style-servo
script-servo
encoding-rs
clap-rs
regex
helloworld
crates.io
hyper
html5ever
tokio-webpush-simple
inflate
syn
futures
piston-image
ripgrep
webrender
cargo

NOTE: This is a rather crude method for measuring RLS performance since there are many more variables that need to be taken into account here. For example, the RLS will invoke the compiler differently, allowing for things to be kept in memory that would go onto the disk otherwise. It also produces "save-analysis" data, which cargo check does not, and the creation of which can take up a significant amount of time and thus should be measured! Consequently, the RLS benchmarks need more discussion.

DIST - Compiling a project for maximum runtime performance

For this scenario, we compile the projects from scratch, with maximum optimizations:

optimized (--opt-level=3, full LTO), non-incremental
optimized (--opt-level=3, whole crate graph ThinLTO), non-incremental

Projects:

style-servo
script-servo
crates.io
tokio-webpush-simple
inflate
ripgrep
webrender
cargo
stm32f103xx

Open Questions

[ ] The sum of all build times might be too crude of a metric. Sometimes a crate does not compile at all with a specific compiler version. Only successful builds should go into the aggregate score. Is there a metric that intrinsically corrects for missing individual scores?
[ ] How to better measure performance in the RLS case?

Please provide your feedback on how well you think the above benchmarks actually measure what people care about when using the Rust compiler. I expect these definitions to undergo a few cycles of iteration before we are satisfied with them.

cc @rust-lang/wg-compiler-performance

michaelwoerister commented 6 years ago

On IRC, @nagisa suggested adding crates with large amounts of generated code to the FROM-SCRATCH benchmark, in particular winapi and svd2rust:

svd2rust is different enough from winapi in that it has actual code and possibly produces actual symbols svd2rust-generated-code that is, not svd2rust itself

I'll add them to FROM-SCRATCH and DIST for now.

nikomatsakis commented 6 years ago

Compiler performance in each scenario is measured by the sum of all build times for a given set of projects

Interesting choice. I would have expected the mean/median/geometric-mean, something like that, but I guess I can see the appeal of the sum. I wonder if we want to make multiple statistics available.

michaelwoerister commented 6 years ago

I think the median would lose too much information. The arithmetic mean is pretty similar to the sum. The geometric mean sounds like an interesting choice. It would give each sub-benchmark exactly the same weight. I'm not sure if that's desirable.

nikomatsakis commented 6 years ago

What do other benchmark suites do?

michaelwoerister commented 6 years ago

SpecInt uses the geometric mean for combining scores. JetStream too. Sunspider and Kraken just use the sum of all execution times (at least on https://arewefastyet.com). Octane uses the geometric mean too.

An additional thought would be to compute the score as seconds / lines of code. That would make the score somewhat more stable under addition or removal of sub-benchmarks. But that's probably an unattainable goal anyway.

nikomatsakis commented 6 years ago

@michaelwoerister

OK, it seems like the question is how much we want the "absolute times" to matter. That is, if our suite includes one thing that takes a long time (say, 1 minute) and a bunch that are very short (sub-second), then effectively improving those won't matter at all. This seems good and bad.

To that end, I wonder if we can just present both. If I had to pick one, though, I'd probably pick geometric mean, since otherwise it seems like smaller tests will just get drowned out and might as well be excluded.

I found this text from Wikipedia helpful to understand geometric mean:

A geometric mean is often used when comparing different items—finding a single "figure of merit" for these items—when each item has multiple properties that have different numeric ranges. For example, the geometric mean can give a meaningful "average" to compare two companies which are each rated at 0 to 5 for their environmental sustainability, and are rated at 0 to 100 for their financial viability. f an arithmetic mean were used instead of a geometric mean, the financial viability is given more weight because its numeric range is larger—so a small percentage change in the financial rating (e.g. going from 80 to 90) makes a much larger difference in the arithmetic mean than a large percentage change in environmental sustainability (e.g. going from 2 to 5). The use of a geometric mean "normalizes" the ranges being averaged, so that no range dominates the weighting, and a given percentage change in any of the properties has the same effect on the geometric mean. So, a 20% change in environmental sustainability from 4 to 4.8 has the same effect on the geometric mean as a 20% change in financial viability from 60 to 72.

scottmcm commented 6 years ago

Geometric mean is a good default for durations (along with geometric standard deviation, for understanding uncertainty in multiple runs of building the same thing).

Arithmetic mean is usually a poor choice for one-sided distributions (like time duration) since the usual μ±2σ so easily goes negative for high-variance measurements, making the usual intuitions less helpful.

On durations specifically, psychology has shown that the perceived midpoint between two durations is at the geometric mean, not the arithmetic mean, in both humans and animals. (I learned this from Designing and Engineering Time (Seow 2008), which references Human bisection at the geometric mean (Allan 1991) and Bisection of temporal intervals (Church and Deluty 1977), though sadly I don't have a free source.)

nnethercote commented 6 years ago

compare.py doesn't try to produce an aggregate score, it just gives speedups per program, viz:

futures-rs-test  5.028s vs  4.433s --> 1.134x faster (variance: 1.020x, 1.030x)
helloworld       0.283s vs  0.235s --> 1.202x faster (variance: 1.012x, 1.025x)
html5ever-2016-  6.293s vs  5.652s --> 1.113x faster (variance: 1.011x, 1.008x)
hyper.0.5.0      6.182s vs  5.039s --> 1.227x faster (variance: 1.002x, 1.018x)
inflate-0.1.0    5.168s vs  4.935s --> 1.047x faster (variance: 1.001x, 1.002x)
issue-32062-equ  0.457s vs  0.347s --> 1.316x faster (variance: 1.010x, 1.007x)
issue-32278-big  2.046s vs  1.706s --> 1.199x faster (variance: 1.003x, 1.007x)
jld-day15-parse  1.793s vs  1.538s --> 1.166x faster (variance: 1.059x, 1.020x)
piston-image-0. 13.871s vs 11.885s --> 1.167x faster (variance: 1.005x, 1.005x)
regex.0.1.30     2.937s vs  2.516s --> 1.167x faster (variance: 1.010x, 1.002x)
rust-encoding-0  2.414s vs  2.078s --> 1.162x faster (variance: 1.006x, 1.005x)
syntex-0.42.2   36.526s vs 32.373s --> 1.128x faster (variance: 1.003x, 1.004x)
syntex-0.42.2-i 21.500s vs 17.916s --> 1.200x faster (variance: 1.007x, 1.013x)

I found that worked well.

michaelwoerister commented 6 years ago

@scottmcm, that's very useful, thanks! @nikomatsakis, let's start with just the geometric mean. It seems to be a good choice overall!

michaelwoerister commented 6 years ago

@nnethercote Yes, that is very useful for gauging the effect of individual optimizations and more detailed investigations. We support a view like this on perf.rlo already (example) and we definitely want to keep and extend that, since it's probably what one is going to use most for day-to-day work.

I still think an additional aggregate score per usage scenario and a dashboard showing it over time is useful for seeing longterm trends.

nnethercote commented 6 years ago

After thinking some more and re-reading the performance section from Hennessy & Patterson, I think the right thing to do for a summary score is the following.

Normalize scores for each benchmark so they have equal weights. E.g. for a series of runs of a particular benchmark, divide those runs's values by the value of first benchmark run. Currently we have drastically unequal weights; script-servo takes longer than all the other benchmarks combined. Giving every benchmark equal weight arguably is somewhat arbitrary, but it's simple, and other potential weightings aren't clearly better, IMO.
Take the geometric mean of these normalized scores to produce the summary score. The good thing about a geometric mean is that it doesn't matter which benchmark run you choose to normalize against, you'll get the same shape graph. (In contrast, the arithmetic mean will give you different-shaped graphs depending on which run you normalize against.)

However, there is a bigger problem that I think needs to be solved first: the benchmark suite is not consistent. New benchmarks get added, old ones get removed, and sometimes benchmarks temporarily fail to compile due to compiler bugs (the compiler needs fixing) or due to the use of unstable code (the benchmarks need fixing).

I tweaked the site's code to simply plot the number of benchmarks that ran successfully. Here's the output for a period of time over May and June this year: summary There is huge variation. Any summary value will be unreliable in the face of this. (Indeed, I completely ignore the summary graphs on perf.rust-lang.org for this reason.)

I think it's critical that we find a way to effectively fill in these gaps. I suggest using interpolation to produce "fake" (but reasonable) values. For example, imagine we have 20 benchmarks, and we run each of them in a particular rustc configuration (e.g. "Check" + "Clean") on 10 different rustc versions. We want 10 different "summary" values, one per rustc version. Imagine also that one of the benchmarks, B, failed to compile on runs 1, 5, 9, and 10, due to bugs in those particular rustc versions. Without interpolation, the summary values for those runs will not match the summary values for all the other runs. Interpolation would work in the following way.

The failure on run 5 occurred in the middle of the series. Use average of the values for runs 4 and 6 for run 5 of benchmark B.
The failure on run 1 occurred at the start of the series. Use the value from run 2 for run 1 of benchmark B.
The failures on runs 9 and 10 occurred at the end of the series. Use the value from run 8 for runs 9 and 10 of benchmark B.

This will fill in most of the gaps, getting us much closer to a complete set of data. It won't help in the case where a benchmark B failed to compile on every run (as happens with some cases with NLL) but it still gets us most of the way there.

@Mark-Simulacrum: I looked at the code yesterday to see how to do this. Unfortunately the data structures used didn't seem all that amenable to working out what the missing data points were and adding them in, but maybe you have some ideas.

rushmorem commented 6 years ago

May I suggest adding the psl crate to your benchmarks? It generates a huge match statement from the Public Suffix List. On machines I have compiled it on, it takes about 20 to 30 minutes to compile :smile:

pnkfelix commented 2 years ago

visiting for backlog bonanza.

It is not clear if there's any information here that isn't already in https://github.com/rust-lang/rustc-perf

@Mark-Simulacrum @rylev @nnethercote I'm just mentioning you in case you happen to see something in this issue's description that seems worth pulling over to the README or other pages for https://github.com/rust-lang/rustc-perf

nnethercote commented 2 years ago

I agree that closing this issue is reasonable. A lot of changes have been made to rustc-perf in the past four years, and this issue is no longer serving a useful purpose.

rust-lang / rust