Closed sfackler closed 6 years ago
Things are significantly worse on 1.13 - 2 minutes in translation!
This may be the effect of the projection cache.
cc @alexcrichton This is obviously a show-stopper for futures.
I can do some profiling. I've had some plans for improving collection/trans that I think may be related. One question to try and answer is what %age of this is just "we are making more code" vs "we are wasting time doing things in trait selection that could be cached". I have observed the latter from time to time and had some thoughts on how to fix it.
I've also seen this before, with tokio-socks5 as well. Removing just a handful of the trait objects in that file makes the compile time of the crate shoot from 2.34s to 89.52s (!!)
Here's a simple example of something that takes a very long time to compile:
future::ok::<(),()>(()).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(())).and_then(|()| Ok(()));
Closed #40280 as a duplicate, moved some example code into the issue header.
Just a stupid question ‒ the „translation item collection“, which is one of the two culprits here, is listed in the „MIR optimisations“. Does it really have to happen at all on non-optimised debug build?
It's natural to read it that way, but the headers are actual after their group, so it's part of translation
This seems to be hard enough to take some time fixing. So I thought of a workaround, if someone is also interested. It uses the trick with placing trait objects to split the chains of modifiers, but only on testing/debug builds where the compilation speed matters, while it keeps the complex but hopefully faster concrete types in release build:
This one is for streams (that's what I needed), but can obviously work for futures or other things as well:
#[cfg(debug_assertions)]
fn test_boxed<T, E, S>(s: S) -> Box<Stream<Item = T, Error = E>>
where S: Stream<Item = T, Error = E> + 'static
{
Box::new(s)
}
#[cfg(not(debug_assertions))]
fn test_boxed<S>(s: S) -> S {
s
}
(I didn't find any better config option than the debug_assertions
, but I hope that one is good enough)
Nominating in place of https://github.com/rust-lang/rust/issues/42941. cc @eddyb (nominator)
Repeating another bad case from #42941:
extern crate futures;
use futures::{future, IntoFuture, Future};
fn main() {
let t: std::result::Result<(), ()> = Ok(());
let f = t
.into_future()
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()))
.and_then(|_| future::ok(()));
f.wait();
}
The code above takes ~750s to compile on my laptop (you can make it shorter/longer by removing/adding .and_then
calls):
...
time: 0.000; rss: 108MB write metadata
time: 472.840; rss: 110MB translation item collection
time: 0.001; rss: 112MB codegen unit partitioning
time: 0.001; rss: 133MB internalize symbols
time: 732.741; rss: 133MB translation
...
The extra ~250s is spent between codegen unit partitioning
and internalize symbols
.
Another observation from #42941 about why this is becoming even more important is that the newly-released hyper 0.11
uses futures everywhere, which will likely cause many more programs to use chains of futures like this going forward.
triage: P-high
We are raising to "high priority" to at least do some investigation and try to determine whether revised trait solving strategies will be of use here.
status: waiting for niko
Bump. This really hurts impl Trait
(boxing hides this issue).
I'm not sure if this is fixed... I'm on nightly, version 1.23.0-nightly
on Arch Linux...
I get the following in -Ztime-passes
:
time: 0.001; rss: 57MB parsing
time: 0.000; rss: 57MB recursion limit
time: 0.000; rss: 57MB crate injection
time: 0.000; rss: 57MB plugin loading
time: 0.000; rss: 57MB plugin registration
time: 0.054; rss: 122MB expansion
time: 0.000; rss: 122MB maybe building test harness
time: 0.000; rss: 122MB maybe creating a macro crate
time: 0.000; rss: 122MB creating allocators
time: 0.000; rss: 122MB checking for inline asm in case the target doesn't support it
time: 0.000; rss: 122MB AST validation
time: 0.006; rss: 128MB name resolution
time: 0.000; rss: 128MB complete gated feature checking
time: 0.000; rss: 128MB lowering ast -> hir
time: 0.000; rss: 128MB early lint checks
time: 0.001; rss: 128MB indexing hir
time: 0.000; rss: 128MB attribute checking
time: 0.000; rss: 128MB lifetime resolution
time: 0.000; rss: 128MB looking for entry point
time: 0.000; rss: 128MB looking for plugin registrar
time: 0.000; rss: 128MB loop checking
time: 0.000; rss: 128MB static item recursion checking
time: 0.000; rss: 132MB stability checking
time: 0.001; rss: 135MB type collecting
time: 0.000; rss: 135MB outlives testing
time: 0.000; rss: 135MB impl wf inference
time: 0.000; rss: 135MB coherence checking
time: 0.000; rss: 135MB variance testing
time: 0.001; rss: 139MB wf checking
time: 0.000; rss: 139MB item-types checking
time: 0.080; rss: 173MB item-bodies checking
time: 0.009; rss: 177MB const checking
time: 0.001; rss: 177MB privacy checking
time: 0.000; rss: 177MB intrinsic checking
time: 0.000; rss: 177MB match checking
time: 0.000; rss: 177MB liveness checking
time: 0.679; rss: 177MB borrow checking
time: 0.000; rss: 177MB MIR borrow checking
time: 0.000; rss: 177MB MIR effect checking
time: 0.000; rss: 177MB death checking
time: 0.000; rss: 177MB unused lib feature checking
time: 0.009; rss: 183MB lint checking
time: 0.000; rss: 183MB resolving dependency formats
time: 0.000; rss: 183MB write metadata
so this looks very similar. In fact, I too have code with a bunch of:
future_maker()
.and_then( futurey_stuff())
.and_then( futurey_stuff())
.and_then( futurey_stuff())
.and_then( futurey_stuff())
.and_then( futurey_stuff())
.and_then( futurey_stuff())
Sorry if it's a bother... just wasn't sure where to ask and this seems to be the most recent reference.
Does it matter that some of the and_then
s have move closures instead?
Of course not. Particularly for the futures case generators will help, but @nikomatsakis doesn't seem to be working on this.
Just to double-check, that'd be std::ops::Generator
?
Why would generators help? It's a trait, it doesn't do away with the deeply nested types.
Generators help because they are an alternative to combinators. When you write a generator, you get a single (anonymous) type for the entire state machine; when you use combinators, you build the state machine out of nested types.
triage: P-medium
I'm going to lower this from P-high to reflect reality. I'm still eager to investigate but haven't had time, and making it P-high is not helping =)
So I finally did some work here. In particular, I extracted out a self-contained example that seems to reproduce the problem, which let me do a bit of experimentation. The gist contains two variants of the example; one that models Future
quite closely (though without IntoFuture
), and the other which removes the Output
type. The timing results are quite interesting. See if you can spot the outlier. ;)
size
variant 4 8 12 16
full .19s .34s 2.5s 36.9s
no-output .19s .26s .9s 9.8s
I haven't had time to dig deeper than that yet.
These can probably be considered even more self-contained than what @nikomatsakis have made in this post, and IMO they show the issue more consistently (reproducible exponential blowups by a factor of approximately 2 for each line of .chain(empty())
in every compilation step I specifically highlighted below) so I'll post them:
$ rustc +nightly --crate-type=lib foo_generic.rs -Z time-passes
...
time: 34.801; rss: 92MB item-bodies checking
...
All other timings are <0.3s.
$ rustc +nightly --crate-type=lib foo_specialized.rs -Z time-passes
...
time: 43.160; rss: 89MB item-bodies checking
...
time: 34.286; rss: 99MB translation item collection
...
time: 129.115; rss: 122MB translate to LLVM IR
...
time: 163.680; rss: 122MB translation
...
time: 28.048; rss: 122MB LLVM passes
...
All other timings are <0.2s.
Rustc version:
$ rustc +nightly --version
rustc 1.23.0-nightly (827cb0d61 2017-11-26)
I'm digging on this. Some easy profiling:
After some digging it seems that https://github.com/rust-lang/rust/blob/ebddfcb0b19471d2277e05cfbc2ad8caa81637b7/src/librustc/ty/subst.rs#L324 may be a cause.
I only have little knowledge regarding the compiler, but if I'm correct this relates to @nikomatsakis experiment on the number of output types. This looks like a place we can do a exponential blowup, by the factor of parameters.
I've tried profiling rustc and getting a backtrace for the generic version by @hcpl above at some random moment of time.
The cause in this case seems to be that at every Chain
the rustc::traits::project::project_type
function calls rustc::traits::select::SelectionContext::select
twice: the first time from assemble_candidates_from_impls
via InferCtxt::probe
, the second time from rustc::traits::project::confirm_candidate
. Both calls recurse on the next level, causing the exponential blowup.
Thanks for the profiles everyone, btw, I've been spending a bit more time investigating too but don't yet have any insights to share. =)
@nikomatsakis can you share what problem have you stumbled over? Thanks to @tanriol's profile I think the root cause is quite clear; we just need to cache the call to eliminate the blowup. What I'm worrying is that SelectionContext::select
does have side effects, and as I don't have much knowledge on projection I don't know if the second (redundant) execution can be easily removed.
Yeah, so this is related to deeply nested types itself, and it doesn't matter where it appear. As argument, return type, or even just creating on stack triggers this issue. #39684 #43757 should be duplicates.
Finding as of today:
The blowup roots from confirm_impl_candidate
in select module (not project). Confirming candidate is a non-cachable operation. It then attempts to normalize the type (which doesn't seem to cache too) inside vtable_impl
and recurs to next level.
cc @tanriol @nikomatsakis Please check if this clears up some obstacles. Primarily I'm wondering why normalization is continuously performed (maybe due to snapshot?).
Okay, I think I've successfully understood the overall picture of this blowup.
First, it's true that we're calling SelectionContext::select
twice per iteration. This is what we probably need to reduce to fix the issue.
Second, the blowup occurs in operations that are not currently cached. In confirm_impl_candidate
, we create an snapshot and then calls vtable_impl
, then impl_or_trait_obligations
. It's the place we recur onto next level, and because we are in a snapshot, the unification results are completely discarded after the operation. This is the second cause making the operation non-cachable: unification relies on side effects so it doesn't execute twice, but we revert the changes so it's computing every time.
Adding to that, impl_or_trait_obligations
tries to normalize the inner type (?) for each predicate, which sounds like a multiplicative behavior here. However, this is ran without clearing the snapshot, which may make the situation better (it still sounds like an issue to me, though). Also, the reproduction example only used Iterator obligation which is the only one, thus the multiplicative behavior here hasn't been tested yet.
To sum up, we didn't get any short circuits here due to the two factors preventing caching from happening. Next, I'm proposing some solutions below, focusing on fixing the "calling twice" problem:
Try to preserve the snapshot for the impl candidate we resolve. We can reuse it later which eliminates the need of calling confirm_candidate the second time. However, as far I understood I'm not sure if snapshot is designed to be saved beyond scope (it always take a closure for pub APIs).
Do not call confirm_candidates
the first time, instead call it only when we're really confirming it. My concern is that this may breaks code that relies on side effects (unification) which I don't understand yet.
This is rather focused on solving the normalization cache, and the idea is quite abstract. We can try to cache normalizations, but I have no idea what we should take as a key.
I need your expertise on this, so we can reach a solution that resolves the issue. Please give me your opinion!
Another update: the projection cache actually exists, but we're discarding them every time by wiping the InferCtxt (with snapshots). I have created a diagram to show how this recurs:
So basically we need to implement this FIXME: https://github.com/rust-lang/rust/blob/554fe71ac6172ce9c0baeabbfd692280dcc893fe/src/librustc/traits/project.rs#L1548-L1550
@arielb1 You seem to have created this FIXME. Can you give me some hints on implementing cross-infcx cache?
I have launched a PoC in #48296 and I confirm that the exponential part should be resolved. I still observe some degree of polynomial time algorithms, which makes rustc chokes at about 50 chain()
.
@hcpl I suspect you copied the wrong code for the specialized version above, as I observe the same time-passes behavior and the code lacks main()
despite you specify --crate-type bin
for the specialized version.
Can you check again, and if possible, provide the correct specialized version code?
Update: although I have fixed the normalization part, the typeck still seems to be exponential on time, and very memory consuming. I'm investigating the cause of this second issue.
I have implemented a fix for the typeck part too, and I believe there's no more exponential algorithms in rustc anymore.
@ishitatsuyuki the code is correct, but both rustc
invocations should have been with --crate-type=lib src/lib.rs
(which were assumed to be in different crates). Sorry about that.
I've changed filenames to foo_generic.rs
and foo_specialized.rs
respectively to better reflect their purpose.
Compiling the postgres-tokio crate at https://github.com/sfackler/rust-postgres/commit/d27518ba76d76ccaa59b3ccd63e981bd8bd0ef33 goes from 5 seconds to 45 seconds on nightly if the two
.boxed()
calls in the middle of this call chain are removed: https://github.com/sfackler/rust-postgres/blob/d27518ba76d76ccaa59b3ccd63e981bd8bd0ef33/postgres-tokio/src/lib.rs#L342-L408.Looks like 15 seconds is spent in translation item collection, and 39 seconds is spent in translation:
Things are significantly worse on 1.13 - 2 minutes in translation!
Some discussion in IRC: https://botbot.me/mozilla/rust-internals/2016-12-22/?msg=78294648&page=1
cc @aturon
UPDATE: #40280 was closed as a duplicate of this. It had the following sample code:
--nmatsakis