Collect binary size statistics

alexcrichton commented 7 years ago

Would be perhaps an interesting metric!

Mark-Simulacrum commented 7 years ago

We need a proposal as to what would be good to collect. Ideally, I think we'd have this statistic track the individual ".so" files we see in the unarchived directory. This would be a somewhat unique statistic since it wouldn't depend on the benchmarks directory, instead only looking at the sizes of the artifacts Rust ships.

I'd appreciate advice on whether we can get more fine-grained details for the size (e.g., debuginfo KB, code size KB, etc.) easily on Linux.

alexcrichton commented 7 years ago

Yeah I was thinking we wouldn't necessarily do this for everything if we couldn't clearly map to an artifact, and we should only do it for "final artifacts" for sure (tracking rlibs wouldn't be too useful). I'd imagine something like:

Size of a "hello world" executable, broken down by:
- allocator/jemalloc (may be to much of a reach)
- debuginfo (before/after strip -g)
- other rustc goop (size after strip -g)
Size of a "hello world" cdylib, broken down by same as above except for jemalloc
Maybe like a small Tokio example or regex example binary?

fitzgen commented 6 years ago

I started on a collection of rust+wasm code size micro benchmarks: https://github.com/fitzgen/code-size-benchmarks

Would be awesome to get this integrated here -- is this the kind of thing you were imagining @alexcrichton ?

alexcrichton commented 6 years ago

@fitzgen looks great! I'd probably throw in non-LTO builds as well for comparison, but otherwise that seems like a great starting point to me. It may even be possible to just set up a daily cron job running on Travis pushing the results somewhere?

Mark-Simulacrum commented 6 years ago

Should be feasible to integrate on a surface basis; I'll look into it sometime over the next couple weeks.

Mark-Simulacrum commented 6 years ago

Okay, so integration is primarily blocked on being able to install targets (i.e., rustup target add equivalent) for benchmarks. This is something we need to do anyway for winapi/windows benchmarks so I'll take a look at doing so soon.

Walther commented 5 years ago

Can I help somehow? I'd love to see binary size statistics get tracked (and ultimately, optimized 🙂)

Mark-Simulacrum commented 5 years ago

There are a couple things that need doing here:

installing targets (not technically critical, since we can track x86_64 sizes, but highly desirable)
getting a list of output artifacts to track for a given invocation of rustc (maybe we can query Cargo via --message-format=json?)

Order here doesn't matter too much, I think.

For the targets, the first step would be to get the sysroot module capable of installing targets other than the default std. We can't rely on the target existing (due to try builds) so these would have to be optional installations. I don't think this is terribly hard, but does require some legwork. rustup --verbose target add can provide an idea of URLs to use; you'll be requesting from the CI bucket, not static.rust-lang.org, but that shouldn't matter too much in theory.

For Cargo, we'll probably want to try to get a representative artifact for each crate -- likely this'll be a heuristic but I suspect that's the best we can do (so/wasm, rlib, binary). I believe Cargo will dump the output artifacts from rustc if run with --message-format=json so we would want to parse that and load the relevant data, then get file sizes of the artifacts and dump them as if perf stat gave them out, I think.

the8472 commented 3 years ago

This would be a somewhat unique statistic since it wouldn't depend on the benchmarks directory, instead only looking at the sizes of the artifacts Rust ships.

As mentioned in https://github.com/rust-lang/rust/pull/82834#issuecomment-805072594, collecting the sizes for the output binaries of the benchmark - those which produce one at least - could be useful too, as a cheaper proxy measure for runtime benchmarks. If more time is spent in LLVM that would hopefully result in smaller binary sizes. That way we could make better decisions around whether compile time regressions can be justified with runtime (or footprint) improvements.

Kobzol commented 2 years ago

I created https://github.com/rust-lang/rustc-perf/pull/1348 as a first step into the direction of collecting binary (artifact) sizes for the existing benchmarks. I would like to eventually generalize rustc-perf to also gather runtime performance stats of compiled binaries, but that's way off, so gathering the artifact sizes is a first step into this generalization.

Going further, I think that it would be nice to record the binary size of a few special benchmarks that are interesting for binary size. For starters, it could be just helloworld, as a minimal program.

We will probably want to record the size of helloworld with various configurations: 1) With panic = "abort". 2) With strip = true. 3) With opt-level = "z". 4) With lto = true. 5) With codegen-units = 1. 6) With no_std. 7) With all of the above combined.

For this to work, we need to: 1) Have the option to pass custom RUSTFLAGS to a benchmark. This could be done by introducing a new scenario or a profile, but I'm not sure how "scalable" that is. Currently the code uses enums for both scenarios and profiles, so adding so many new scenarios or profiles would cause a lot of churn. Maybe we can add a new enum, like Scenario::Custom { rustflags: String }? Or we could create an additional column in the DB for custom flags. 2) Have the option to specify which benchmarks are interesting for binary size comparisons (i.e. which benchmarks should run with all of the configurations described above). Otherwise we would benchmark all of the crates with the cartesian product of the existing scenario/profile configurations and the above options, which is not plausible.

We could just manually `--include` and `--exclude` the set of benchmarks that are interesting for compilation performance and artifact sizes in the collect script, but it seems a bit cumbersome. Maybe we could generalize the benchmark category field? Something like:
```
primary - real-world benchmarks useful for compilation performance comparison
secondary - artificial benchmarks useful for compilation performance comparison
artifact-size - benchmarks useful for artifact size comparison
(in the future) runtime - benchmarks useful for runtime performance comparison
```

CC @Mark-Simulacrum

nnethercote commented 2 years ago

   ```
   primary - real-world benchmarks useful for compilation performance comparison
   secondary - artificial benchmarks useful for compilation performance comparison
   artifact-size - benchmarks useful for artifact size comparison
   (in the future) runtime - benchmarks useful for runtime performance comparison
   ```

Categories are currently mutually exclusive. Would that still be true with this suggestion?

I agree that these measurements could be useful, and that it's hard to decide what to collect and how to collect them.

Obviously Check and Doc profiles aren't relevant. Would we measure for both Debug and Opt, or just Opt?
As for scenarios, presumably just Full is enough?
The "Patched" scenarios could be a guide here -- it's the one mechanism we currently have where each benchmark can opt into 0..N additional runs of a certain kind. In that case it's via the presence of .patch files.

Perhaps add some new optional scenarios: FullPanicAbort, FullStripTrue, FullOptLevelZ, FullLtoTrue, FullCodegenUnits1, FullNoStd, FullAll? And then use the profile-config.json file to opt-in to each one? You could generalize this further and allow any benchmark to opt into any set of scenarios, though that might not be useful in practice. I think that would be better than changing categories.

Mark-Simulacrum commented 2 years ago

I also think scenarios are a good fit here, pretty much along the same lines that @nnethercote suggests. I'm not sure I buy the argument that enums make things harder here -- I think it's good we need to think about each variant if we're matching on the enum; if we can handle it opaquely, then I would expect us to be able to still do that (e.g., by just passing it along, or rendering it to a String for the UI, etc.).

Kobzol commented 2 years ago

Scenarios currently have some interactions between each other, e.g. Full has to be completed before the Incr ones can be started, IncrPatched performs a patching operation during the benchmark. So they don't serve only to change the compilation flags, but they also affect how is the benchmark performed. It seems to me that the mentioned compile options are slightly orthogonal to that, I could want to benchmark how PanicAbort works with Full or an IncrFull run. In terms of the benchmark interaction, neither of the mentioned configurations should change how the benchmark is performed, we just want to say that we want to compile it N times with different compilation options.

We need to do several things with the compilation variants: 1) Store them into DB 2) Present them in the UI 3) Opt-in into them for certain benchmarks

We can use an explicit enum variants, like FullOptZ, FullPanicAbort etc. Benchmarks could opt-in for these either explicitly in collect.sh, or via profile-config.json ({"extra-scenarios": ["FullPanicAbort", "FullStripTrue"]}). Advantages: Easy to store into DB and present in the UI. Explicitness in code. Disadvantages: Won't be possible to easily choose between FullPanicAbort and IncrFullPanicAbort. Questions: I'm not sure about the CLI - would we allow users to select these variants with --scenarios? What about the --scenarios all, should it select these (somehow secondary) variants?

The more I'm thinking about this, the more it seems to me that compile time flags like LTO, panic=abort, CGU=1 etc. which do not affect the benchmark process itself (unlike incr. and patching), should be orthogonal to the other options. What about creating a new DB column and an enum like Flags::OptZ, Flags::PanicAbort? This would make it explicit in code, would be easy to store in DB and print in UI, would be backwards compatible (we can just leave it empty if there are no non-default flags to be passed), would be easy to opt-in using both CLI and JSON and it would be possible to combine with debug/opt and incr/full.

Of course debug/opt is also a compile time flag in a way, but it's special enough (isn't normally passed through RUSTFLAGS or Cargo.toml options) that I wouldn't mind having it in the Profile category, like it is now.

As for scenarios, presumably just Full is enough?

Incremental compilation artifact sizes are probably also quite interesting.

Mark-Simulacrum commented 2 years ago

The more I'm thinking about this, the more it seems to me that compile time flags like LTO, panic=abort, CGU=1 etc. which do not affect the benchmark process itself (unlike incr. and patching)

I'm having trouble following why LTO is any different from incremental in terms of affecting compilation - both of these are going to increase compile times (or not), have a different memory profile, produce a different binary output...

I agree in some sense that if we end up with a complete cross matrix of the current scenarios and the new options, storing them separately could make sense. But it also seems to me that for ease of implementation, we can do so purely at the code level (perhaps only in the collector) and serialize into the database as just a scenario. I'm not convinced we have a good semantic for the extra flags column that pushes it to be distinct from scenarios.

Kobzol commented 2 years ago

I'm having trouble following why LTO is any different from incremental in terms of affecting compilation - both of these are going to increase compile times (or not), have a different memory profile, produce a different binary output...

The difference is that incr. + patching requires changes to the code that runs the benchmark, while LTO/PanicAbort/etc. doesn't, we just use different flags. But it's not so important.

I agree in some sense that if we end up with a complete cross matrix of the current scenarios and the new options, storing them separately could make sense. But it also seems to me that for ease of implementation, we can do so purely at the code level (perhaps only in the collector) and serialize into the database as just a scenario. I'm not convinced we have a good semantic for the extra flags column that pushes it to be distinct from scenarios.

I agree! Storing compilation flags into DB/UI as "just another scenario string", but treating them as a different axis that can be set on the collector level sounds like a good compromise.

Kobzol commented 2 years ago

This has now been implemented in https://github.com/rust-lang/rustc-perf/pull/1348, as a first step.

Kobzol commented 1 year ago

We now record both the size of generated artifacts and the binary size of the compiler (and its associated libraries) itself. And we can also distinguish between library/binary artifacts, to get more context about the sizes. So I think that this can be closed (more specialized issues can be created for concrete use-cases).

rust-lang / rustc-perf

Collect binary size statistics #145