rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.14k stars 12.69k forks source link

The `rustc_query_impl` crate is too big, which hurts compile times for the compiler itself #65031

Open nnethercote opened 5 years ago

nnethercote commented 5 years ago

EDIT(Centril): The crate was renamed to rustc_middle in https://github.com/rust-lang/rust/pull/70536. EDIT(jyn514): Most of rustc_middle was split into rustc_query_impl in https://github.com/rust-lang/rust/pull/70951.

The results from @ehuss's new cargo -Ztimings feature really drive home how the size of the rustc crate hurts compile times for the compiler. Here's a picture: a Things to note.

Also, even doing check builds on rustc code is painful. On my fast 28-core Linux box it takes 10-15 seconds even for the first compile error to come up. This is much longer than other crates. Refactorings within the rustc crate that require many edit/compile cycles are painful.

I've been told that rustc is quite tangled and splitting it up could be difficult. That may well be true. The good news is that the picture shows we don't need to split it into 10 equal-sized pieces to get benefits. Even splitting off small chunks into separate crates will have an outsized effect on compilation times.

petrochenkov commented 5 years ago

librustc still has a lot of code that's "just implementation code" and not something infrastructural like types or traits used in the rest of the compiler.

If we look e.g. at rustc/middle: resolve_lifetime.rs - just a pass, can be moved to rustc_resolve, stability.rs - just a pass, can be moved to rustc_passes or something.

crlf0710 commented 5 years ago

On my windows box, crate rustc takes about 2000 seconds to build... Grateful to anything that makes the compilation process faster.

RalfJung commented 5 years ago

One thing that would be nice to move back from librustc to librustc_mir is all the memory access logic of the Miri interpreter, which these days lives in librustc::mir::interpret::allocation. But @oli-obk moved it there because it is used by things that do not depend on librustc_mir, so I am not sure what it would take to make that happen.

Mark-Simulacrum commented 5 years ago

Can those things not depend on librustc_mir, e.g., being dependencies of librustc_mir? Can @oli-obk comment on that perhaps?

oli-obk commented 5 years ago

Most of the datastructures need to live there, because the incremental serializer lives there (though we could probably move both out) and because some of the interners intern types like Allocation. I'm not sure how much of the code is needed there, but we can't add inherent impls in downstream crates, so if we want inherent methods, we need to implement them there.

I've tried a few times to split things out of librustc, it's just so entangled with everything that I failed.

RalfJung commented 5 years ago

but we can't add inherent impls in downstream crates, so if we want inherent methods, we need to implement them there.

If it's just that we could use extension traits.

Zoxc commented 4 years ago

The way we should structure the compiler for compilation speed would be to have n independent crates defining types basically working as header files. Then we'd have a librustc_queries crates which just declares queries which depends on the n header crates. After that we'd fan out to m independent crates containing query definitions and other code.

Currently n is 1 (libsyntax) and way too much stuff is in librustc which prevents the m crates from starting sooner. I think it would make sense to move types from librustc into crates which librustc can depend on and code from librustc into crates which depends on librustc.

I see that @Centril is kind of doing this already, but I think it would be a good idea to gain some compiler team sign-off here.

The thing preventing moves of types tend to be inherent utility methods which require TyCtxt. Maybe we could move these to extension traits instead? (with maybe a prelude for these?) One goal could be to remove all uses of TyCtxt in librustc except for the query system and then extract the query system into librustc_queries.

cc @rust-lang/compiler

Zoxc commented 4 years ago

Another way to deal with TyCtxt would be to have a QueryCtxt trait in each crate and use impl QueryCtxt instead of TyCtxt for the utility methods in these crates. Maybe we could even remove the dependency on librustc_queries for some crates that way?

Zoxc commented 4 years ago

I ran -Ztimings too on Windows: Results.

It seems like rustc_driver is taking very long here, probably linker related. We can however see that libsyntax is much shorter after @Centril's changes.

Here is also a check build.

02.02.2020 timings. 27.03.2020 timings.

cjgillot commented 4 years ago

Please let me know if you need help in this endeavour.

ecstatic-morse commented 4 years ago

There's been a lot of work done to split up rustc since this issue was filed. Could someone with access to a many-core machine (16+) post the results of -Ztimings on a recent master?

andjo403 commented 4 years ago

16 core machine built with CARGOFLAGS_BOOTSTRAP=-Ztimings ./x.py build --stage 1

windows linux (wsl2)

nnethercote commented 4 years ago

So rustc is still responsible for a sizeable chunk of CPU under-utilization. And rustc_driver is roughly as bad.

nnethercote commented 4 years ago

On my 14-core Linux box rustc_driver takes 93s, while the under-utilized final part of rustc compilation only takes about 50s.

Why is rustc_driver so slow to compile? It's only 1,800 lines of code. And why is all its compile time in the compiler front-end, with no time for codegen?

nnethercote commented 4 years ago

perf top indicates that most of the time during rustc_driver compilation is in libbd-2.33-system.so, i.e. it's linking. Ugh, lld would be much faster... #39915 raises its head again.

Zoxc commented 4 years ago

@nnethercote You can try setting [rust] use-lld = true in config.toml.

RalfJung commented 4 years ago

Locally it also seems like building librustc_driver causes enormous amounts of disk I/O, not just CPU load.

nnethercote commented 4 years ago

Large amounts of disk I/O makes sense for linking. In #39915 I mentioned that switching to lld reduced by rustc_driver compile time from ~93s to ~41s.

lambda-fairy commented 4 years ago

Do we know where most of the rustc crate build time comes from?

I had a quick look and it seems that the main challenge is that the MIR interpreter and type context depend on each other -- is that correct?

Centril commented 4 years ago

@andjo403 Could you try the current HEAD with the same machine to see if there were any improvements?

andjo403 commented 4 years ago

16 core 32 threads machine built with CARGOFLAGS_BOOTSTRAP=-Ztimings ./x.py build --stage 1

build of https://github.com/rust-lang/rust/commit/150322f86d441752874a8bed603d71119f190b8b with default config.toml

the wsl times can differ more due to I do not remember if I checked that the config.toml was "clear" last time. I noticed that I had the parallel-compiler set to true some time ago but do not know if that was in the last measurement or not but as this is a build with the beta compiler maybe not that big difference. also something to remember is that the beta compiler have been updated since the last time.

windows linux (wsl2)

Centril commented 4 years ago

@andjo403 That's looking promising, thank you.

N.B. the crate has been renamed to rustc_middle in https://github.com/rust-lang/rust/pull/70536.

matthiaskrgr commented 4 years ago

The efforts to split up rustc_middle did reduce its compiletime from ~20 to ~15 minutes on my system IIRC, but right now it seems to be (back?) at 22 minutes.

camelid commented 4 years ago

What are some ways that we can tackle this a bit at a time? @jyn514 mentioned that taking up #78275 would help, but I don't feel that I have enough knowledge of the query system to do that.

cjgillot commented 4 years ago

@camelid you can ping me on zulip if I can help.

jyn514 commented 3 years ago

@cjgillot addressed most of this in https://github.com/rust-lang/rust/pull/70951. Do we still need to keep this open? rustc_middle is now only medium-large, and rustc_query_impl is very big but comparable to rustc_mir and a few others.

jyn514 commented 1 year ago

Here is the output of CARGOFLAGS=--timings x build std on latest master (88c58e3c2c097ebffac425d9e080dcb1aadf790e): image

Full detail (you need to rename it from .txt to .html to view it because github is annoying and blocks html files): cargo-timing-rustc.txt

In particular, rustc_query_impl is still the largest crate by far and is the limiting factor for how soon rustc_driver can start building.

lqd commented 1 year ago

And the delay between rustc_query_impl completion and the second slowest crate rustc_driver is waiting on can be bigger with more parallelism, here, 30 seconds:

image

In this example, rustc_middle is also still somewhat stalling the pipeline, like this issue suggests, and reducing e.g. its metadata would make its 10 dependent crates start building earlier.

(@jyn514 you can use a gist and its raw link with the htmlpreview webapp, like this)

jyn514 commented 1 year ago

Here is the output of -Zdump-mono-items on query_impl (forked to output JSON instead of markdown): https://gist.github.com/jyn514/e20a1a4c542770d855240169c4a85331

There are some very surprising things here:

$ jq 'sort_by(.total_estimate) | .[-10:] | .[] | { name, total_estimate, instantiation_count }' rustc_query_impl.mono_items.json -c
{"name":"<plumbing::QueryCtxt<'_> as rustc_query_system::query::QueryContext>::start_query::{closure#0}","total_estimate":75240,"instantiation_count":684}
{"name":"std::ptr::mut_ptr::<impl *mut T>::is_null","total_estimate":77119,"instantiation_count":3353}
{"name":"std::ptr::const_ptr::<impl *const T>::is_aligned_to","total_estimate":79212,"instantiation_count":1722}
{"name":"plumbing::force_from_dep_node","total_estimate":84360,"instantiation_count":285}
{"name":"rustc_middle::ty::tls::with_context_opt","total_estimate":84747,"instantiation_count":1599}
{"name":"rustc_query_system::query::plumbing::execute_job","total_estimate":85728,"instantiation_count":228}
{"name":"rustc_query_system::dep_graph::DepGraph::<K>::with_task_impl","total_estimate":89148,"instantiation_count":228}
{"name":"rustc_query_system::query::plumbing::try_load_from_disk_and_cache_in_memory","total_estimate":102372,"instantiation_count":228}
{"name":"hashbrown::raw::RawTable::<T, A>::reserve_rehash","total_estimate":111354,"instantiation_count":277}
{"name":"std::thread::LocalKey::<T>::try_with","total_estimate":186412,"instantiation_count":3214}
jyn514 commented 1 year ago

Since https://github.com/rust-lang/rust/pull/108638 rustc_query_impl takes significantly less time to compile and is no longer the limiting factor, at least for my setup: image

I'm going to close this issue as completed - there's always more perf work to do, but I think query_impl is now fast enough we don't have to single it out :)

Thank you to everyone who worked on this!

RalfJung commented 1 year ago

Looks like we are back to rustc_middle being the bottleneck, as was the case when this issue originally got created? ;)

jyn514 commented 1 year ago

Yes, but in a different way than before - now the bottleneck is generating metadata for rustc_middle, not performing codegen.

Zoxc commented 1 year ago

@jyn514 Could you try a build with rust.codegen-units = 1? rustc_query_impl is still holding things up slightly for me.

jyn514 commented 1 year ago

Hmm, do you have codegen-units=1 set for local benchmarking or some other reason?

Here's the graph with 1 CGU - I do see query_impl causing a slight delay now. image

Zoxc commented 1 year ago

It's for benchmarking and LLVM IR inspection purposes. The total build time is similar for 16 and 1 CGUs for me on my Ryzen 7 1700. What are you building on? There seems to be a bit of a divergence in build behavior.

jyn514 commented 1 year ago

I don't have time in the near future to investigate the difference, happy to reopen in the meantime.

Zoxc commented 1 year ago

@jyn514 Did you build with rust.lto set to thin-local or off? I built with thin-local, and off is the new compiler profile default, so that could explain the divergence.

jyn514 commented 1 year ago

ahhhh - yes, that's likely it.