Open nnethercote opened 5 years ago
librustc
still has a lot of code that's "just implementation code" and not something infrastructural like types or traits used in the rest of the compiler.
If we look e.g. at rustc/middle
: resolve_lifetime.rs
- just a pass, can be moved to rustc_resolve
, stability.rs
- just a pass, can be moved to rustc_passes
or something.
On my windows box, crate rustc
takes about 2000 seconds to build... Grateful to anything that makes the compilation process faster.
One thing that would be nice to move back from librustc to librustc_mir is all the memory access logic of the Miri interpreter, which these days lives in librustc::mir::interpret::allocation
. But @oli-obk moved it there because it is used by things that do not depend on librustc_mir, so I am not sure what it would take to make that happen.
Can those things not depend on librustc_mir, e.g., being dependencies of librustc_mir? Can @oli-obk comment on that perhaps?
Most of the datastructures need to live there, because the incremental serializer lives there (though we could probably move both out) and because some of the interners intern types like Allocation
. I'm not sure how much of the code is needed there, but we can't add inherent impls in downstream crates, so if we want inherent methods, we need to implement them there.
I've tried a few times to split things out of librustc, it's just so entangled with everything that I failed.
but we can't add inherent impls in downstream crates, so if we want inherent methods, we need to implement them there.
If it's just that we could use extension traits.
The way we should structure the compiler for compilation speed would be to have n
independent crates defining types basically working as header files. Then we'd have a librustc_queries
crates which just declares queries which depends on the n
header crates. After that we'd fan out to m
independent crates containing query definitions and other code.
Currently n
is 1 (libsyntax) and way too much stuff is in librustc
which prevents the m
crates from starting sooner. I think it would make sense to move types from librustc
into crates which librustc
can depend on and code from librustc
into crates which depends on librustc
.
I see that @Centril is kind of doing this already, but I think it would be a good idea to gain some compiler team sign-off here.
The thing preventing moves of types tend to be inherent utility methods which require TyCtxt
. Maybe we could move these to extension traits instead? (with maybe a prelude for these?) One goal could be to remove all uses of TyCtxt
in librustc
except for the query system and then extract the query system into librustc_queries
.
cc @rust-lang/compiler
Another way to deal with TyCtxt
would be to have a QueryCtxt
trait in each crate and use impl QueryCtxt
instead of TyCtxt
for the utility methods in these crates. Maybe we could even remove the dependency on librustc_queries
for some crates that way?
I ran -Ztimings
too on Windows: Results.
It seems like rustc_driver
is taking very long here, probably linker related. We can however see that libsyntax
is much shorter after @Centril's changes.
Please let me know if you need help in this endeavour.
There's been a lot of work done to split up rustc
since this issue was filed. Could someone with access to a many-core machine (16+) post the results of -Ztimings
on a recent master
?
16 core machine
built with CARGOFLAGS_BOOTSTRAP=-Ztimings ./x.py build --stage 1
So rustc
is still responsible for a sizeable chunk of CPU under-utilization. And rustc_driver
is roughly as bad.
On my 14-core Linux box rustc_driver
takes 93s, while the under-utilized final part of rustc
compilation only takes about 50s.
Why is rustc_driver
so slow to compile? It's only 1,800 lines of code. And why is all its compile time in the compiler front-end, with no time for codegen?
perf top
indicates that most of the time during rustc_driver
compilation is in libbd-2.33-system.so
, i.e. it's linking. Ugh, lld
would be much faster... #39915 raises its head again.
@nnethercote You can try setting [rust] use-lld = true
in config.toml
.
Locally it also seems like building librustc_driver
causes enormous amounts of disk I/O, not just CPU load.
Large amounts of disk I/O makes sense for linking. In #39915 I mentioned that switching to lld reduced by rustc_driver
compile time from ~93s to ~41s.
Do we know where most of the rustc
crate build time comes from?
I had a quick look and it seems that the main challenge is that the MIR interpreter and type context depend on each other -- is that correct?
@andjo403 Could you try the current HEAD with the same machine to see if there were any improvements?
16 core 32 threads machine
built with CARGOFLAGS_BOOTSTRAP=-Ztimings ./x.py build --stage 1
build of https://github.com/rust-lang/rust/commit/150322f86d441752874a8bed603d71119f190b8b with default config.toml
the wsl times can differ more due to I do not remember if I checked that the config.toml was "clear" last time. I noticed that I had the parallel-compiler set to true some time ago but do not know if that was in the last measurement or not but as this is a build with the beta compiler maybe not that big difference. also something to remember is that the beta compiler have been updated since the last time.
@andjo403 That's looking promising, thank you.
N.B. the crate has been renamed to rustc_middle
in https://github.com/rust-lang/rust/pull/70536.
The efforts to split up rustc_middle
did reduce its compiletime from ~20 to ~15 minutes on my system IIRC, but right now it seems to be (back?) at 22 minutes.
What are some ways that we can tackle this a bit at a time? @jyn514 mentioned that taking up #78275 would help, but I don't feel that I have enough knowledge of the query system to do that.
@camelid you can ping me on zulip if I can help.
@cjgillot addressed most of this in https://github.com/rust-lang/rust/pull/70951. Do we still need to keep this open? rustc_middle is now only medium-large, and rustc_query_impl is very big but comparable to rustc_mir and a few others.
Here is the output of CARGOFLAGS=--timings x build std
on latest master (88c58e3c2c097ebffac425d9e080dcb1aadf790e):
Full detail (you need to rename it from .txt to .html to view it because github is annoying and blocks html files): cargo-timing-rustc.txt
In particular, rustc_query_impl is still the largest crate by far and is the limiting factor for how soon rustc_driver can start building.
And the delay between rustc_query_impl
completion and the second slowest crate rustc_driver
is waiting on can be bigger with more parallelism, here, 30 seconds:
In this example, rustc_middle
is also still somewhat stalling the pipeline, like this issue suggests, and reducing e.g. its metadata would make its 10 dependent crates start building earlier.
(@jyn514 you can use a gist and its raw link with the htmlpreview
webapp, like this)
Here is the output of -Zdump-mono-items
on query_impl (forked to output JSON instead of markdown): https://gist.github.com/jyn514/e20a1a4c542770d855240169c4a85331
There are some very surprising things here:
$ jq 'sort_by(.total_estimate) | .[-10:] | .[] | { name, total_estimate, instantiation_count }' rustc_query_impl.mono_items.json -c
{"name":"<plumbing::QueryCtxt<'_> as rustc_query_system::query::QueryContext>::start_query::{closure#0}","total_estimate":75240,"instantiation_count":684}
{"name":"std::ptr::mut_ptr::<impl *mut T>::is_null","total_estimate":77119,"instantiation_count":3353}
{"name":"std::ptr::const_ptr::<impl *const T>::is_aligned_to","total_estimate":79212,"instantiation_count":1722}
{"name":"plumbing::force_from_dep_node","total_estimate":84360,"instantiation_count":285}
{"name":"rustc_middle::ty::tls::with_context_opt","total_estimate":84747,"instantiation_count":1599}
{"name":"rustc_query_system::query::plumbing::execute_job","total_estimate":85728,"instantiation_count":228}
{"name":"rustc_query_system::dep_graph::DepGraph::<K>::with_task_impl","total_estimate":89148,"instantiation_count":228}
{"name":"rustc_query_system::query::plumbing::try_load_from_disk_and_cache_in_memory","total_estimate":102372,"instantiation_count":228}
{"name":"hashbrown::raw::RawTable::<T, A>::reserve_rehash","total_estimate":111354,"instantiation_count":277}
{"name":"std::thread::LocalKey::<T>::try_with","total_estimate":186412,"instantiation_count":3214}
LocalKey
is the highest impact to compile times; we should investigate using some other kind of thread-local storage (maybe #[thread_local]
?) instead.start_query
is instantiated too many times; it should only be once per query, but it's being instantiated 3 times for eachSince https://github.com/rust-lang/rust/pull/108638 rustc_query_impl takes significantly less time to compile and is no longer the limiting factor, at least for my setup:
I'm going to close this issue as completed - there's always more perf work to do, but I think query_impl is now fast enough we don't have to single it out :)
Thank you to everyone who worked on this!
Looks like we are back to rustc_middle being the bottleneck, as was the case when this issue originally got created? ;)
Yes, but in a different way than before - now the bottleneck is generating metadata for rustc_middle, not performing codegen.
@jyn514 Could you try a build with rust.codegen-units = 1? rustc_query_impl
is still holding things up slightly for me.
Hmm, do you have codegen-units=1 set for local benchmarking or some other reason?
Here's the graph with 1 CGU - I do see query_impl causing a slight delay now.
It's for benchmarking and LLVM IR inspection purposes. The total build time is similar for 16 and 1 CGUs for me on my Ryzen 7 1700. What are you building on? There seems to be a bit of a divergence in build behavior.
I don't have time in the near future to investigate the difference, happy to reopen in the meantime.
@jyn514 Did you build with rust.lto
set to thin-local
or off
? I built with thin-local
, and off
is the new compiler profile default, so that could explain the divergence.
ahhhh - yes, that's likely it.
EDIT(Centril): The crate was renamed to
rustc_middle
in https://github.com/rust-lang/rust/pull/70536. EDIT(jyn514): Most of rustc_middle was split intorustc_query_impl
in https://github.com/rust-lang/rust/pull/70951.The results from @ehuss's new
cargo -Ztimings
feature really drive home how the size of therustc
crate hurts compile times for the compiler. Here's a picture: Things to note.rustc
crate takes about twice as long to compile as therustc_mir
crate, which is the second-biggest crate.rustc
crate's codegen is happening. (Thank goodness for pipelining, it really makes a difference in this case.)rustc
crate without anything else happening in parallel (from 42-67s, and from 141-174s).Also, even doing
check
builds onrustc
code is painful. On my fast 28-core Linux box it takes 10-15 seconds even for the first compile error to come up. This is much longer than other crates. Refactorings within therustc
crate that require many edit/compile cycles are painful.I've been told that
rustc
is quite tangled and splitting it up could be difficult. That may well be true. The good news is that the picture shows we don't need to split it into 10 equal-sized pieces to get benefits. Even splitting off small chunks into separate crates will have an outsized effect on compilation times.