Open rcgoodfellow opened 1 year ago
Some additional tests, also running on illumos.
Ryzen 5950x (16c / 32t, Base Clock 3.4GHz, Boost up to 4.9Ghz) 128 GB RAM running @ 3200 Dedicated 1 TB Samsung NVME (OS is running on a separate NVME)
cargo clean && ptime -m cargo build
real 4:46.005452752
user 43:57.502065154
sys 8:42.829707839
trap 2.247956401
tflt 2.673304572
dflt 3.661350049
kflt 0.017162225
lock 1:13:34.345323723
slp 1:32:29.511645923
lat 1:27.684343420
stop 8:28.648802889
$ du -Ash target
31.9G target
$ cargo clean && ptime -m cargo nextest run
[cargo nextest]
Summary [ 168.925s] 794 tests run: 794 passed (5 slow, 1 leaky), 2 skipped
[ptime]
real 12:11.677374521
user 3:05:58.623930006
sys 30:48.644936368
trap 34.106167448
tflt 9.639465852
dflt 14.283905823
kflt 0.288599200
lock 309:53:28.524718771
slp 16:57:48.486904890
lat 44:10.005121372
stop 59:03.009517416
$ du -Ash target
78.1G target
Looks like test
builds a lot more stuff, compared to the normal debug build. Target dir increased by over 2x.
Actual test runtime took ~2 min 49 seconds, so the increase in ptime stats seems to also be mostly due to building the additional test binaries.
Running with -p omicron-nexus
does not make a big difference.
ry@masaka:~/src/omicron$ cargo nextest run -p omicron-nexus
warning: ignoring unknown configuration keys in config file /home/ry/src/omicron/.config/nextest.toml: nexte
st-version
Compiling omicron-nexus v0.1.0 (/home/ry/src/omicron/nexus)
Compiling mg-admin-client v0.1.0 (/home/ry/src/omicron/mg-admin-client)
Finished test [unoptimized + debuginfo] target(s) in 10m 20s
Starting 317 tests across 4 binaries
I was watching htop
for this one. The linker took up ~ 8m 50s minutes of that 10m 20s.
NOTE TO READER: THE FOLLOWING ARE ALL ON A 32 GiB, 16 CORE LINUX SYSTEM
I tried setting up an incremental rebuild of Nexus, using the following command:
$ touch nexus/src/lib.rs && cargo build --timings -p omicron-nexus --tests
I observed build times of roughly 50 seconds, with the following --timing
info:
There are a couple takeaways from this:
This seems really odd to me, in particular for schema-updater
, because it should be a really thing binary, but it's still taking 10+ seconds to build.
Running the following command, I tried to inspect the schema-updater binary size:
size -A -d schema-updater | awk '{print $2,$1}' | tail -n +3 | numfmt --field=1 --to=iec-i --suffix=B | sort -h
0B .tm_clone_table
8B .fini_array
13B .fini
16B .init_array
24B .rela.plt
27B .init
28B .gnu.hash
28B .interp
32B .got.plt
32B .note.ABI-tag
34B .debug_gdb_scripts
36B .note.gnu.build-id
48B .plt
48B .plt.got
120B .tdata
165B .comment
328B .tbss
424B .data
704B .dynamic
768B .gnu.version_r
1.4KiB .bss
2.0KiB .gnu.version
4.6KiB .gcc_except_table
17KiB .dynstr
22KiB .rodata.cst
24KiB .dynsym
68KiB .debug_macro
193KiB .got
230KiB .eh_frame_hdr
340KiB .data.rel.ro
448KiB .rodata
736KiB .eh_frame
997KiB .rela.dyn
2.4MiB .debug_abbrev
4.0MiB .debug_frame
5.7MiB .text
8.1MiB .debug_aranges
12MiB .debug_loc
19MiB .debug_ranges
37MiB .debug_line
213MiB .debug_info
397MiB .debug_pubnames
663MiB .debug_str
743MiB .debug_pubtypes
2.1GiB Total
(The Nexus binary itself is very similar in size, coming in at 2.2 GiB)
The default profile.dev
option for Debug is 2
. By setting it to `1, this binary size is significantly reduced (this is the schema-updater again):
0B .tm_clone_table
8B .fini_array
13B .fini
16B .init_array
24B .rela.plt
27B .init
28B .gnu.hash
28B .interp
32B .got.plt
32B .note.ABI-tag
34B .debug_gdb_scripts
36B .note.gnu.build-id
48B .plt
48B .plt.got
120B .tdata
165B .comment
328B .tbss
424B .data
704B .dynamic
768B .gnu.version_r
1.4KiB .bss
2.0KiB .gnu.version
4.6KiB .gcc_except_table
17KiB .dynstr
22KiB .rodata.cst
24KiB .dynsym
68KiB .debug_macro
89KiB .debug_pubtypes
191KiB .got
229KiB .eh_frame_hdr
340KiB .data.rel.ro
447KiB .rodata
710KiB .debug_abbrev
721KiB .eh_frame
992KiB .rela.dyn
3.9MiB .debug_frame
4.8MiB .text
7.8MiB .debug_aranges
11MiB .debug_ranges
31MiB .debug_info
35MiB .debug_line
379MiB .debug_pubnames
436MiB .debug_str
910MiB Total
Additionally, the total build time seems to be reduced a fair bit:
Setting debug = 0
in [profile.dev]
of Cargo.toml
brings this down even further:
Setting debug = 0 in [profile.dev] of Cargo.toml brings this down even further
I'm seeing a pretty good bump from this. Using a fresh checkout of omicron:
# cargo clean + cargo build
Finished dev [unoptimized + debuginfo] target(s) in 5m 41s
real 5:41.684878634
user 52:01.649785641
sys 10:20.348812801
trap 2.118963336
tflt 10.500090796
dflt 11.440283705
kflt 0.603422488
lock 1:39:59.263917987
slp 2:11:25.139189481
lat 1:37.587088675
stop 10:27.215142698
# one line change in nexus/src/app/sagas/switch_port_settings_apply.rs, then cargo build
Finished dev [unoptimized + debuginfo] target(s) in 1m 16s
real 1:16.668390021
user 1:34.092838558
sys 1:02.971599531
trap 0.013044407
tflt 0.001391870
dflt 0.420686927
kflt 0.000000000
lock 5:16.007108476
slp 10:58.984496465
lat 0.236748885
stop 0.280932951
Then setting debug = 0
:
# cargo clean + cargo build
Finished dev [unoptimized] target(s) in 4m 03s
real 4:03.711154640
user 38:45.185029976
sys 7:39.982399942
trap 1.445820996
tflt 3.634785810
dflt 2.003498021
kflt 0.000645907
lock 1:00:08.045694370
slp 1:28:24.252445845
lat 1:33.636755682
stop 10:18.723593083
# one line change in nexus/src/app/sagas/switch_port_settings_apply.rs, then cargo build
Finished dev [unoptimized] target(s) in 38.04s
real 38.186279952
user 41.574763210
sys 11.621525240
trap 0.006395064
tflt 0.001001251
dflt 0.022501930
kflt 0.000015539
lock 2:22.804716633
slp 3:52.751385681
lat 0.025387724
stop 0.255555593
Edit:
Ryzen 3950x (16c / 32t, base: 3.5GHz, boost: 4.7GHz) 128GB DDR4 500G sata ssd
Seeing a big improvement from debug = 0
on mac too.
This did not result in a large difference for me.
ry@masaka:~/src/omicron$ ptime -m cargo nextest run
warning: ignoring unknown configuration keys in config file /home/ry/src/omicron/.config/nextest.toml: nextest-version
Compiling omicron-nexus v0.1.0 (/home/ry/src/omicron/nexus)
Compiling omicron-dev-tools v0.1.0 (/home/ry/src/omicron/dev-tools)
Finished test [unoptimized] target(s) in 9m 55s
I tried the same workflow on Linux. The following is for a one-line saga change recompile.
ry@rytk2:~/src/omicron$ time cargo nextest run
Compiling omicron-nexus v0.1.0 (/home/ry/src/omicron/nexus)
Compiling omicron-dev-tools v0.1.0 (/home/ry/src/omicron/dev-tools)
Finished test [unoptimized + debuginfo] target(s) in 2m 01s
Starting 805 tests across 95 binaries (2 skipped)
I do not have mold
or lld
installed on this system, so this should just be using the standard linux linker.
ry@rytk2:~$ ls -l /usr/bin/ld
lrwxrwxrwx 1 root root 19 Jun 3 23:49 /usr/bin/ld -> x86_64-linux-gnu-ld
This machine is significantly less capable than my regular dev machine. It has a 6-core Xeon Mobile E-2176M processor and 64G of RAM.
Using mold
on Linux is about twice as fast. But I also noticed something weird. When I make a one-line change in a file with my editor (vim) and then save, cargo will re-compile things on the next two invocations of cargo nextest
. Even weirder, this redundant compilation does not happen when using touch
instead of adding a one-line comment to a source file.
#
# makes chanes in file
#
ry@rytk2:~/src/omicron$ time cargo nextest run --no-run
Compiling omicron-nexus v0.1.0 (/home/ry/src/omicron/nexus)
Compiling omicron-dev-tools v0.1.0 (/home/ry/src/omicron/dev-tools)
Finished test [unoptimized + debuginfo] target(s) in 1m 07s
real 1m7.938s
user 3m33.098s
sys 0m38.624s
ry@rytk2:~/src/omicron$ time cargo nextest run --no-run
Compiling omicron-nexus v0.1.0 (/home/ry/src/omicron/nexus)
Compiling omicron-dev-tools v0.1.0 (/home/ry/src/omicron/dev-tools)
Finished test [unoptimized + debuginfo] target(s) in 44.23s
real 0m44.624s
user 1m8.197s
sys 0m18.583s
ry@rytk2:~/src/omicron$ time cargo nextest run --no-run
Finished test [unoptimized + debuginfo] target(s) in 0.85s
real 0m1.223s
user 0m1.470s
sys 0m1.232s
One more data point:
[profile.dev]
debug = "line-tables-only" # This required an update to 1.71.0!
This still preserves backtraces, but otherwise removes all debug info from the binaries.
Considering that the default option for release builds already removes all this debug info, I think this is a worthwhile inclusion. I'll put up a PR soon with this change, which should be one small step towards alleviating this pain during incremental development.
Hey folks! Sorry I was out Friday. I caught up on the recording.
I indeed was putting some time into this directly, back in like, March. What was frustrating is that everything I tried showed absolutely no improvements, which is why you didn't see any patches from me. I was putting together a slide deck, expecting to show it off at demo day, so I apologize for screenshots rather than saving the text of this stuff. Here's what I tried and what I found at the time:
First of all, it was suggested to me that I mostly try and build nexus, so I used it as a baseline. That is, for "dirty" builds, I would change lines in omicron-nexus and rebuild. First, I wanted to get some baseline numbers of where compilation time was taking.
A dirty rebuild was spending the vast, vast majority of its time in codegen, both debug:
and release:
So I decided to focus my initial work on bringing codegen time down. I spent a lot of time figuring out how to use various exotic compilation options, benchmarking them, and comparing them.
There was a general suspicion at the time that generics and monomorphization was a significant portion of the compile-time budget, due to diesel and other things. I tried out the "share generics" optimization:
This ended up with basically the same numbers as before; I should re-look into this one, however, because in my notes I just have "build with shared generics" and not if it was debug or release; these are the same times as the debug build, which would make sense given
https://github.com/rust-lang/rust/blob/master/compiler/rustc_session/src/config.rs#L1077-L1080
It is possible trying this again for release builds would be of interest, however it does require a nightly compiler to turn this on, as do many of the exotic options I was trying.
Next up, I wanted to validate that in fact monomorphization was a problem. llvm-lines
to the rescue:
This tool measures the number and size of instantiations of each generic function in a program, indicating which parts of your code offer the highest leverage in improving compilation metrics.
By default it sorts by size, that is, which functions have the largest bodies.
So let's see:
We have:
So that all makes sense, and confirms what we suspected. However... there's a problem with this analysis. Let's look at the bottom of the list:
A lot of very similar functions end up having different names, due to closures having different names, so their symbols are slightly different, and so this doesn't really give a super accurate picture, because some functions are basically duplicated in a way that llvm-lines won't see.
We can fix that though: RUSTFLAGS=”-Csymbol-mangling-version=v0”
ends up fixing this:
To which you may say "holy crap why is register_endpoint a big deal here? It is twice as large as the next function" Well if we look at that function at the time:
We'll come back to this one later.
So what about number of copies? Well:
Our largest three function by number of copies is async machinery. Not much we can do there, that I'm aware of. In general, I would characterize this as like:
but even the stuff with the most copies is under 2%. Of course, if we group them together, like all the 'async machinery' functions, they'd be larger, but like... there's no easy wins here.
What about with our special name mangling?
why is validating an error our top result here? Well earlier we suspected api.register
for being generic. Let's take a look at the llvm-ir
for register_endpoints
, its caller. We end up with a massive sequence of near-identical blocks that look like this:
api.register
is a generic function, so we were inlining a ton of monomorphized code into this huge function body, causing it to be the largest function by far. The fix to that was https://github.com/oxidecomputer/dropshot/pull/597, which manually outlines this API.
This helped, but only a small amount, and we've already done it for the largest function by far. This leads me to believe that trying to deal with the largest functions won't be particularly productive in reducing codgen time, as we've basically already taken all the low-hanging fruit with this one thing.
At this point I felt like I had exhausted the llvm-lines
based approach. Time to try another tool.
measureme
is "Support crate for rustc's self-profiling feature" https://github.com/rust-lang/measureme
Let's see what rustc
says is taking all of its time:
We have similar results to our other tooling: optimization is taking 20% of the time, LTO is taking %17 + 14%, then various codegen things and by the time we get to type checking we're down to 0.5% of the total time. This continues to tell me that codegen is just simply slow. Which is not helpful, but I guess it's nice that tools aren't contradicting each other?
Next I tried out cargo-bloat
: https://github.com/RazrFalcon/cargo-bloat Instead of llvm-lines
, which analysizes llvm-ir, cargo bloat looks at what survives all those passes. Maybe there's some clues there?
It's kind of interesting that quote
shows up, but there are a lot of proc macros, so this makes sense. But still, like, all of these things end up being so tiny that outliners don't really exist.
Cargo bloat does have one more interesting option: --time
. This tests how long various things take to compile. I have in my notes "don't try this at home," I think that this one took a very, very long time to complete:
omicron_nexus
continues to show up as an overall outlier here.
At this point I had exhausted all of my research on how to tackle this problem, and was sick of making my machine (and atrium for that matter) cry by building everything over and over and over again. So I left it there.
I still am not 100% sure what will bring build times down significantly. It is possible Rust is simply this slow to compile, when building large projects that rely on generics as heavily as we do. I hope that that isn't totally true, but at least from this line of investigation I did, I'm not sure what more we can do there, specificialy.
I think it would be really helpful to close the gap between illumos and Linux. Incremental test compilation for saga changes on Linux is around a minute and around ten on illumos. I'm fine with a longer wait upfront if iteration can move quickly after an initial compile.
I'm curious about your time-based cargo-bloat result, @steveklabnik. In particular, lalrpop and polar-core seem to both be part of the oso authnz stuff:
$ cargo tree -q -i lalrpop
lalrpop v0.19.12
[build-dependencies]
└── polar-core v0.26.4
└── oso v0.26.4
├── nexus-db-queries v0.1.0 (/ws/omicron/nexus/db-queries)
│ ├── nexus-test-utils v0.1.0 (/ws/omicron/nexus/test-utils)
│ │ └── omicron-dev-tools v0.1.0 (/ws/omicron/dev-tools)
│ │ [dev-dependencies]
│ │ ├── nexus-db-queries v0.1.0 (/ws/omicron/nexus/db-queries) (*)
│ │ └── omicron-nexus v0.1.0 (/ws/omicron/nexus)
│ │ └── omicron-dev-tools v0.1.0 (/ws/omicron/dev-tools)
│ └── omicron-nexus v0.1.0 (/ws/omicron/nexus) (*)
└── omicron-nexus v0.1.0 (/ws/omicron/nexus) (*)
Is it possible that oso (and, critically, code generated by oso policies) is actually what's vastly expensive to compile?
That's consistent with my experience with lalrpop. I tried using lalrpop in the front end of the P4 compiler initially, and the compile times were unreasonably slow.
@jclulow I did find it a bit odd, as these packages never appeared in my other analyses. Then again I was focused on incremental compile-times of omicron-nexus, which means I don't think it would have showed up in those graphs.
@jclulow I did find it a bit odd, as these packages never appeared in my other analyses. Then again I was focused on incremental compile-times of omicron-nexus, which means I don't think it would have showed up in those graphs.
My fear with a lot of the stuff which is basically code generation (serde, lalrpop, diesel, etc) is that it can be extremely difficult to detect and correctly categorise the impact in at least two dimensions:
I also wonder, wrt. to our linker challenges, how many sections and symbols are produced in expensive code generation only then to be discarded during the expensive link step, getting us both coming and going as they say.
I will now note that cargo bloat --time
will now print
Note: prefer using
cargo --timings
.
So... yeah. As always these tools are mostly heuristics, useful for tracking down leads, but are also sometimes misleading.
See also: #1122.
More anecdata (from Oct 11, so this is a bit old):
cargo run --bin=omdb
to get it builtcargo nextest run --no-run -p omicron-omdb
to get that builtomdb
(touch dev-tools/omdb/src/bin/omdb/main.rs
), then run either of these commands, it takes about 6s. i.e., it takes the same amount of time to rebuild the omdb binary or its test. (And to be clear, I touch'd the file again in between these commands so that I should have been testing the same thing each time. It's not 6 additional seconds to build the tests.)nexus/db-queries
(a dep of omicron-omdb): cargo run --bin=omdb
took about 12.7s now -- makes sense that it might be a bit longer. But the cargo nextest run
invocation took 4m50s.Here's the actual sequence:
dap@ivanova omicron-inventory $ ptime -m cargo nextest run --no-run -p omicron-omdb
info: experimental features enabled: setup-scripts
Finished test [unoptimized + debuginfo] target(s) in 0.70s
real 1.838577525
user 1.142917910
sys 0.635211780
trap 0.000181429
tflt 0.011755183
dflt 0.050526203
kflt 0.000002083
lock 0.487250811
slp 3.273384219
lat 0.009670932
stop 0.161927711
dap@ivanova omicron-inventory $ ptime -m cargo run --bin=omdb
Finished dev [unoptimized + debuginfo] target(s) in 0.67s
Running `target/debug/omdb`
Omicron debugger (unstable)
Usage: omdb [OPTIONS] <COMMAND>
Commands:
db Query the control plane database (CockroachDB)
mgs Debug a specific Management Gateway Service instance
nexus Debug a specific Nexus instance
oximeter Query oximeter collector state
sled-agent Debug a specific Sled
help Print this message or the help of the given subcommand(s)
Options:
--log-level <LOG_LEVEL> log level filter [env: LOG_LEVEL=] [default: warn]
--dns-server <DNS_SERVER> [env: OMDB_DNS_SERVER=]
-h, --help Print help (see more with '--help')
real 0.823996539
user 0.514413420
sys 0.312149573
trap 0.000073869
tflt 0.000677192
dflt 0.000009447
kflt 0.000000000
lock 0.160386006
slp 0.036371324
lat 0.000332270
stop 0.000217022
dap@ivanova omicron-inventory $ touch dev-tools/omdb/src/bin/omdb/main.rs
dap@ivanova omicron-inventory $ ptime -m cargo run --bin=omdb
Compiling omicron-omdb v0.1.0 (/home/dap/omicron-inventory/dev-tools/omdb)
Finished dev [unoptimized + debuginfo] target(s) in 7.64s
Running `target/debug/omdb`
Omicron debugger (unstable)
Usage: omdb [OPTIONS] <COMMAND>
Commands:
db Query the control plane database (CockroachDB)
mgs Debug a specific Management Gateway Service instance
nexus Debug a specific Nexus instance
oximeter Query oximeter collector state
sled-agent Debug a specific Sled
help Print this message or the help of the given subcommand(s)
Options:
--log-level <LOG_LEVEL> log level filter [env: LOG_LEVEL=] [default: warn]
--dns-server <DNS_SERVER> [env: OMDB_DNS_SERVER=]
-h, --help Print help (see more with '--help')
real 7.766483179
user 4.362707462
sys 2.829051006
trap 0.000626510
tflt 0.040078158
dflt 0.332766975
kflt 0.007159075
lock 19.905019879
slp 24.678334104
lat 0.009740832
stop 0.070412119
dap@ivanova omicron-inventory $ touch dev-tools/omdb/src/bin/omdb/main.rs
dap@ivanova omicron-inventory $ ptime -m cargo nextest run --no-run -p omicron-omdb info: experimental features enabled: setup-scripts
Compiling omicron-omdb v0.1.0 (/home/dap/omicron-inventory/dev-tools/omdb)
Finished test [unoptimized + debuginfo] target(s) in 6.75s
real 7.759561985
user 5.967499388
sys 3.493188458
trap 0.000847790
tflt 0.000849156
dflt 0.001432364
kflt 0.000000000
lock 24.488297810
slp 42.928272275
lat 0.008038384
stop 0.287555207
dap@ivanova omicron-inventory $ touch nexus/db-queries/src/db/datastore/inventory.rs
dap@ivanova omicron-inventory $ ptime -m cargo run --bin=omdb
Compiling nexus-db-queries v0.1.0 (/home/dap/omicron-inventory/nexus/db-queries)
Compiling omicron-omdb v0.1.0 (/home/dap/omicron-inventory/dev-tools/omdb)
Finished dev [unoptimized + debuginfo] target(s) in 12.10s
Running `target/debug/omdb`
Omicron debugger (unstable)
Usage: omdb [OPTIONS] <COMMAND>
Commands:
db Query the control plane database (CockroachDB)
mgs Debug a specific Management Gateway Service instance
nexus Debug a specific Nexus instance
oximeter Query oximeter collector state
sled-agent Debug a specific Sled
help Print this message or the help of the given subcommand(s)
Options:
--log-level <LOG_LEVEL> log level filter [env: LOG_LEVEL=] [default: warn]
--dns-server <DNS_SERVER> [env: OMDB_DNS_SERVER=]
-h, --help Print help (see more with '--help')
real 12.226250993
user 7.904808550
sys 3.903073005
trap 0.001204963
tflt 0.034137448
dflt 0.045761531
kflt 0.000010159
lock 32.117216826
slp 33.204286610
lat 0.015440673
stop 0.098503080
dap@ivanova omicron-inventory $ touch nexus/db-queries/src/db/datastore/inventory.rs
dap@ivanova omicron-inventory $ ptime -m cargo nextest run --no-run -p omicron-omdb
info: experimental features enabled: setup-scripts
Compiling nexus-db-queries v0.1.0 (/home/dap/omicron-inventory/nexus/db-queries)
Compiling omicron-nexus v0.1.0 (/home/dap/omicron-inventory/nexus)
Compiling nexus-test-utils v0.1.0 (/home/dap/omicron-inventory/nexus/test-utils)
Compiling omicron-omdb v0.1.0 (/home/dap/omicron-inventory/dev-tools/omdb)
Finished test [unoptimized + debuginfo] target(s) in 4m 49s
real 4:50.851127646
user 4:45.996576215
sys 19.619832512
trap 0.045308518
tflt 0.010773210
dflt 0.004733686
kflt 0.000007914
lock 24:09.533682953
slp 32:53.794376066
lat 0.073881124
stop 0.513001738
@sunshowers asked for cargo nextest run --no-run -p omicron-omdb --timings
so I ran that. I'll attach the output. This is a branch of mine with some extra stuff but I wouldn't expect it to significantly affect any of this.
tl;dr; A single-line change in a saga has a 10-minute build penalty for running tests. I'm developing on a 64-core machine with 256G of RAM using NVMe disks with trim enabled.
I run tests with
cargo nextest run
. I do not typically use-p
as it's not always clear what packages my changes may intersect with. In these cases, I feel like the build system should be the one figuring out what needs to be rebuilt and not the user. It's also not clear why a change in a saga, which is seemingly near the top of the build pyramid, would cause a rebuild of unrelated testing packages.Here is what a test run looks like with no-compilation, e.g. just running the tests.
Here is a one-line change to saga code.
Within that time the build time is the majority as reported by cargo.