Closed alexcrichton closed 1 year ago
We need a proposal as to what would be good to collect. Ideally, I think we'd have this statistic track the individual ".so" files we see in the unarchived directory. This would be a somewhat unique statistic since it wouldn't depend on the benchmarks directory, instead only looking at the sizes of the artifacts Rust ships.
I'd appreciate advice on whether we can get more fine-grained details for the size (e.g., debuginfo KB, code size KB, etc.) easily on Linux.
Yeah I was thinking we wouldn't necessarily do this for everything if we couldn't clearly map to an artifact, and we should only do it for "final artifacts" for sure (tracking rlibs wouldn't be too useful). I'd imagine something like:
strip -g
)strip -g
)I started on a collection of rust+wasm code size micro benchmarks: https://github.com/fitzgen/code-size-benchmarks
Would be awesome to get this integrated here -- is this the kind of thing you were imagining @alexcrichton ?
@fitzgen looks great! I'd probably throw in non-LTO builds as well for comparison, but otherwise that seems like a great starting point to me. It may even be possible to just set up a daily cron job running on Travis pushing the results somewhere?
Should be feasible to integrate on a surface basis; I'll look into it sometime over the next couple weeks.
Okay, so integration is primarily blocked on being able to install targets (i.e., rustup target add equivalent) for benchmarks. This is something we need to do anyway for winapi/windows benchmarks so I'll take a look at doing so soon.
Can I help somehow? I'd love to see binary size statistics get tracked (and ultimately, optimized 🙂)
There are a couple things that need doing here:
--message-format=json
?)Order here doesn't matter too much, I think.
For the targets, the first step would be to get the sysroot module capable of installing targets other than the default std. We can't rely on the target existing (due to try builds) so these would have to be optional installations. I don't think this is terribly hard, but does require some legwork. rustup --verbose target add
can provide an idea of URLs to use; you'll be requesting from the CI bucket, not static.rust-lang.org, but that shouldn't matter too much in theory.
For Cargo, we'll probably want to try to get a representative artifact for each crate -- likely this'll be a heuristic but I suspect that's the best we can do (so/wasm, rlib, binary). I believe Cargo will dump the output artifacts from rustc if run with --message-format=json
so we would want to parse that and load the relevant data, then get file sizes of the artifacts and dump them as if perf stat
gave them out, I think.
This would be a somewhat unique statistic since it wouldn't depend on the benchmarks directory, instead only looking at the sizes of the artifacts Rust ships.
As mentioned in https://github.com/rust-lang/rust/pull/82834#issuecomment-805072594, collecting the sizes for the output binaries of the benchmark - those which produce one at least - could be useful too, as a cheaper proxy measure for runtime benchmarks. If more time is spent in LLVM that would hopefully result in smaller binary sizes. That way we could make better decisions around whether compile time regressions can be justified with runtime (or footprint) improvements.
I created https://github.com/rust-lang/rustc-perf/pull/1348 as a first step into the direction of collecting binary (artifact) sizes for the existing benchmarks. I would like to eventually generalize rustc-perf
to also gather runtime performance stats of compiled binaries, but that's way off, so gathering the artifact sizes is a first step into this generalization.
Going further, I think that it would be nice to record the binary size of a few special benchmarks that are interesting for binary size. For starters, it could be just helloworld
, as a minimal program.
We will probably want to record the size of helloworld
with various configurations:
1) With panic = "abort"
.
2) With strip = true
.
3) With opt-level = "z"
.
4) With lto = true
.
5) With codegen-units = 1
.
6) With no_std
.
7) With all of the above combined.
For this to work, we need to:
1) Have the option to pass custom RUSTFLAGS
to a benchmark. This could be done by introducing a new scenario or a profile, but I'm not sure how "scalable" that is. Currently the code uses enums for both scenarios and profiles, so adding so many new scenarios or profiles would cause a lot of churn. Maybe we can add a new enum, like Scenario::Custom { rustflags: String }
? Or we could create an additional column in the DB for custom flags.
2) Have the option to specify which benchmarks are interesting for binary size comparisons (i.e. which benchmarks should run with all of the configurations described above). Otherwise we would benchmark all of the crates with the cartesian product of the existing scenario/profile configurations and the above options, which is not plausible.
We could just manually `--include` and `--exclude` the set of benchmarks that are interesting for compilation performance and artifact sizes in the collect script, but it seems a bit cumbersome. Maybe we could generalize the benchmark category field? Something like:
```
primary - real-world benchmarks useful for compilation performance comparison
secondary - artificial benchmarks useful for compilation performance comparison
artifact-size - benchmarks useful for artifact size comparison
(in the future) runtime - benchmarks useful for runtime performance comparison
```
CC @Mark-Simulacrum
``` primary - real-world benchmarks useful for compilation performance comparison secondary - artificial benchmarks useful for compilation performance comparison artifact-size - benchmarks useful for artifact size comparison (in the future) runtime - benchmarks useful for runtime performance comparison ```
Categories are currently mutually exclusive. Would that still be true with this suggestion?
I agree that these measurements could be useful, and that it's hard to decide what to collect and how to collect them.
.patch
files.Perhaps add some new optional scenarios: FullPanicAbort, FullStripTrue, FullOptLevelZ, FullLtoTrue, FullCodegenUnits1, FullNoStd, FullAll? And then use the profile-config.json
file to opt-in to each one? You could generalize this further and allow any benchmark to opt into any set of scenarios, though that might not be useful in practice. I think that would be better than changing categories.
I also think scenarios are a good fit here, pretty much along the same lines that @nnethercote suggests. I'm not sure I buy the argument that enums make things harder here -- I think it's good we need to think about each variant if we're matching on the enum; if we can handle it opaquely, then I would expect us to be able to still do that (e.g., by just passing it along, or rendering it to a String for the UI, etc.).
Scenarios currently have some interactions between each other, e.g. Full has to be completed before the Incr ones can be started, IncrPatched performs a patching operation during the benchmark. So they don't serve only to change the compilation flags, but they also affect how is the benchmark performed. It seems to me that the mentioned compile options are slightly orthogonal to that, I could want to benchmark how PanicAbort works with Full
or an IncrFull
run. In terms of the benchmark interaction, neither of the mentioned configurations should change how the benchmark is performed, we just want to say that we want to compile it N times with different compilation options.
We need to do several things with the compilation variants: 1) Store them into DB 2) Present them in the UI 3) Opt-in into them for certain benchmarks
We can use an explicit enum variants, like FullOptZ
, FullPanicAbort
etc. Benchmarks could opt-in for these either explicitly in collect.sh
, or via profile-config.json
({"extra-scenarios": ["FullPanicAbort", "FullStripTrue"]}
).
Advantages: Easy to store into DB and present in the UI. Explicitness in code.
Disadvantages: Won't be possible to easily choose between FullPanicAbort
and IncrFullPanicAbort
.
Questions: I'm not sure about the CLI - would we allow users to select these variants with --scenarios
? What about the --scenarios all
, should it select these (somehow secondary) variants?
The more I'm thinking about this, the more it seems to me that compile time flags like LTO, panic=abort, CGU=1 etc. which do not affect the benchmark process itself (unlike incr. and patching), should be orthogonal to the other options. What about creating a new DB column and an enum like Flags::OptZ, Flags::PanicAbort
?
This would make it explicit in code, would be easy to store in DB and print in UI, would be backwards compatible (we can just leave it empty if there are no non-default flags to be passed), would be easy to opt-in using both CLI and JSON and it would be possible to combine with debug/opt and incr/full.
Of course debug/opt is also a compile time flag in a way, but it's special enough (isn't normally passed through RUSTFLAGS
or Cargo.toml
options) that I wouldn't mind having it in the Profile
category, like it is now.
As for scenarios, presumably just Full is enough?
Incremental compilation artifact sizes are probably also quite interesting.
The more I'm thinking about this, the more it seems to me that compile time flags like LTO, panic=abort, CGU=1 etc. which do not affect the benchmark process itself (unlike incr. and patching)
I'm having trouble following why LTO is any different from incremental in terms of affecting compilation - both of these are going to increase compile times (or not), have a different memory profile, produce a different binary output...
I agree in some sense that if we end up with a complete cross matrix of the current scenarios and the new options, storing them separately could make sense. But it also seems to me that for ease of implementation, we can do so purely at the code level (perhaps only in the collector) and serialize into the database as just a scenario. I'm not convinced we have a good semantic for the extra flags column that pushes it to be distinct from scenarios.
I'm having trouble following why LTO is any different from incremental in terms of affecting compilation - both of these are going to increase compile times (or not), have a different memory profile, produce a different binary output...
The difference is that incr. + patching requires changes to the code that runs the benchmark, while LTO/PanicAbort/etc. doesn't, we just use different flags. But it's not so important.
I agree in some sense that if we end up with a complete cross matrix of the current scenarios and the new options, storing them separately could make sense. But it also seems to me that for ease of implementation, we can do so purely at the code level (perhaps only in the collector) and serialize into the database as just a scenario. I'm not convinced we have a good semantic for the extra flags column that pushes it to be distinct from scenarios.
I agree! Storing compilation flags into DB/UI as "just another scenario string", but treating them as a different axis that can be set on the collector level sounds like a good compromise.
This has now been implemented in https://github.com/rust-lang/rustc-perf/pull/1348, as a first step.
We now record both the size of generated artifacts and the binary size of the compiler (and its associated libraries) itself. And we can also distinguish between library/binary artifacts, to get more context about the sizes. So I think that this can be closed (more specialized issues can be created for concrete use-cases).
Would be perhaps an interesting metric!