Closed bgourlie closed 5 years ago
I've had no issues compiling in the past (on nightly)
Is there perhaps at least a vague idea of what nightly it was that used to work well?
I can confirm that cargo check
succeeds on the following toolchains (apologies, I should have checked on my un-updated laptops prior to submitting this ticket!):
master branch:
remove-unstable-features branch
Does not reproduce on Linux.
Updated my toolchains on OS X, was able to reproduce on stable, on nightly it succeeded but takes an incredibly long time (12 minutes, 41 seconds on 2.2ghz Intel i7)
Unable to reproduce. I am also on linux. (triple: x86_64-unknown-linux-gnu
)
git clone https://github.com/bgourlie/rs-nes.git
cd rs-nes
git checkout 4d98499f # HEAD of remove-unstable-features at time of posting
cargo +stable check # using rustc 1.32.0 (9fda7c223 2019-01-16)
Also unable to reproduce on Linux (WSL).
Okay. I can confirm that it takes a suspiciously long time to cargo check
on nightly even on linux (either on rs-nes master
or remove-unstable-features
). It does eventually finish successfully.
I attached gdb
to a running process and sampled some backtraces:
When cargo check
ing with RUSTFLAGS="-Ztime-passes"
, I can see that the expand crate
step takes an unusually big amount of time:
Checking rs_nes v0.0.1 (/tmp/rs-nes/rs-nes)
time: 0.009; rss: 54MB parsing
time: 0.000; rss: 54MB attributes injection
time: 0.000; rss: 54MB garbage collect incremental cache directory
time: 0.000; rss: 54MB recursion limit
time: 0.000; rss: 54MB crate injection
time: 0.000; rss: 54MB plugin loading
time: 0.000; rss: 54MB plugin registration
time: 0.000; rss: 54MB background load prev dep-graph
time: 0.002; rss: 55MB pre ast expansion lint checks
time: 176.138; rss: 100MB expand crate
time: 0.000; rss: 100MB check unused macros
time: 176.138; rss: 100MB expansion
time: 0.000; rss: 100MB maybe building test harness
time: 0.001; rss: 100MB maybe creating a macro crate
time: 0.007; rss: 100MB creating allocators
time: 0.004; rss: 100MB AST validation
time: 0.016; rss: 102MB name resolution
time: 0.003; rss: 102MB complete gated feature checking
time: 0.000; rss: 102MB blocked while dep-graph loading finishes
time: 0.020; rss: 116MB lowering ast -> hir
time: 0.016; rss: 116MB early lint checks
time: 0.008; rss: 125MB validate hir map
time: 0.051; rss: 125MB indexing hir
time: 0.000; rss: 125MB load query result cache
time: 0.000; rss: 125MB looking for entry point
time: 0.000; rss: 125MB dep graph tcx init
time: 0.000; rss: 125MB looking for plugin registrar
time: 0.000; rss: 125MB looking for derive registrar
time: 0.001; rss: 127MB loop checking
time: 0.002; rss: 127MB attribute checking
time: 0.004; rss: 127MB stability checking
time: 0.010; rss: 131MB type collecting
time: 0.000; rss: 131MB outlives testing
time: 0.000; rss: 131MB impl wf inference
time: 0.045; rss: 142MB coherence checking
time: 0.000; rss: 142MB variance testing
triage. P-high. Assigning to self for further bisection on OS X (to hopefully confirm or reject the reporter's hypothesis regarding #57494
On my Mac OS X laptop, the source of the slowdown appears to be tied to the expansion-time construction (or maybe compilation) of the following line:
const CYCLES_MAP: [u8; #total_cycles] = [#(#compact_cycle_number_map),*];
Or at least, when I replace that array with one that just says:
const CYCLES_MAP: [u8; #total_cycles] = [0; #total_cycles];
then the overall compilation via cargo check
completes quite quickly.
eprintln!
instrumentation reveals that during the expansion in question, the const array in question has 89,342 entries (i.e. total_cycles == 89342
).
I am now focusing on bisection; I found that I was able to comment out most of the other code, effectively generating a much reduced test case, which I will try to extract into its own crate (or perhaps pair of source files? we'll see) in a bit.
Here is a gist that captures the test case I've reduced this to for now (unfortunately gist does not allow forward-slashes in the names so I've used spaces instead of slashes as the directory separator): https://gist.github.com/pnkfelix/06a63f45fda5066245b9b516eea68598
Regarding that reduced test case, here's what I've observed so far:
rustup
. In the reduced test case, I tried adjusting the size of the array in the generated code by changing the values of SCANLINES
and CYCLES_PER_SCANLINE
. (You can see that in the version I posted above, there is eprintln!
instrumentation reporting that the total_cycles
is 17,000.)
The main reason why I've included the above table (in the details block) is it looks like there may be a non-linear relationship from total_cycles
to the elapsed time. each row increases the total_cycles
by 1,000, but the delta for time increase starts at around 1.5 seconds but grows to 3 seconds by the end. (I also included the table because on some of my runs, the step from 16,000 to 17,000 jumped by quite a lot. But after repeated runs it appears that may have been noise.)
Between the two nightlies I referenced in the above comment, here are the PR's that were merged:
% git log 75a369c5b..2fadb0a16 --author=bors --format=oneline
TokenStream
less recursive.)is_finite
(2x) and is_infinite
(1.6x).)import _thread
for Python 3)String
to &'static str
in ParseResult::Failure
.)ptr::eq
where applicable)Self
ctor as a local variable)Of the PR's referenced above, my spidey-sense is most-immediately firing in response to seeing #57004 "(Make TokenStream less recursive.) "
Okay further bisection indicates that the regression was injected by 2cf736f.
Spidey-sense continues to fire in response to #57004. I'm going to look into reverting that on its own and see what happens.
According to an Instruments
run profiling the invocation of rustc
on my demo test code, the vast majority of the time is being spent here:
https://github.com/rust-lang/rust/blob/106b3e9fa4d53efc4e8eab47d1043789c88f99a5/src/libsyntax/tokenstream.rs#L262
I'm going to review #57004 and follow-on code a little, see if I can determine whether we can get rid of that eager traversal and cloning of the contents of the vector of streams.
cc @nnethercote
Okay I think I have a fix.
I made two different micro-optimizations to the code path I identified in my previous comment.
streams
to figure out what capacity our accumulating vec
is going to require, and then build the vec with that capacity.Lrc
for stream2
has a single owner (i.e. a ref-count of one); if so, then we can just move the trees directly from stream2
into vec
, without cloning them.After rebuilding the compiler with those micro-optimizations, my demo code compiles much faster than before.
Here's the current diff. I'm going to try to identify which of the two micro-optimizations was the important one:
diff --git a/src/libsyntax/tokenstream.rs b/src/libsyntax/tokenstream.rs
index f5d2d6f18e..db6b701b83 100644
--- a/src/libsyntax/tokenstream.rs
+++ b/src/libsyntax/tokenstream.rs
@@ -255,11 +255,19 @@ impl TokenStream {
0 => TokenStream::empty(),
1 => streams.pop().unwrap(),
_ => {
- let mut vec = vec![];
+ let tree_count = streams.iter()
+ .map(|ts| match &ts.0 { None => 0, Some(s) => s.len() })
+ .sum();
+ let mut vec = Vec::with_capacity(tree_count);
for stream in streams {
match stream.0 {
None => {},
- Some(stream2) => vec.extend(stream2.iter().cloned()),
+ Some(stream2) => {
+ match Lrc::try_unwrap(stream2) {
+ Ok(trees) => vec.extend(trees.into_iter()),
+ Err(stream2) => vec.extend(stream2.iter().cloned()),
+ }
+ }
}
}
TokenStream::new(vec)
One mystery to me is why the problem was not replicated on Linux. If I am correct that one or both of the micro-optimizations listed above fixes this issue, then we should be observing a similar execution time hit on Linux systems, I imagine...
... although ... maybe it is the with_capacity
call that is fixing things on OS X, and memory allocation on Linux is simply so much better than on OS X that it is able to mask the problem there?
Update: I overlooked @ExpHP 's report that the slowdown does reproduce on Linux.
Okay I have now determined that on my OS X machine, it is the with_capacity
delta that makes the difference here. The attempt to avoid cloning the trees within a singly-owned vector doesn't help at all for my reduced demo.
Well, there were two problems here, right? One that does not replicate on Linux (crash), and one that does (super slow cargo check
)
Oh, I'm sorry, I overlooked @ExpHP 's comment that they reproduced the slowdown on Linux.
Just to make sure I cover all the issues here, I have now replicated the stack overflow on OS X atop commit 9fda7c2237
A backtrace shows that the overflow is happening during a particularly deep series of drop calls. (Chances seem high that this is related in some manner to how the token-tree/token-stream was (re)structured.)
While attempting to compile my application (or simply run
cargo check
), rustc will crash, with slightly different results depending on whether it's the stable or nightly branch.OS: Windows 10 x64 version 1809 build 17763.253
stable toolchain
cargo check
result:thread 'main' has overflowed its stack
Repro steps
cd rs-nes
git checkout remove-unstable-features
cargo check
nightly toolchain
cargo check
result:Process finished with exit code -1
Repro steps
Suspected cause
I've narrowed down the issue to a procedural macro that I've had no issues compiling in the past (on nightly). It generates a large match statement which made me think that this recent change may be related, but I have no evidence to support it other than the fact that it changes how the compiler behaves wrt large match statements.
It's also worth noting that I created the
remove-unstable-features
branch specifically to see if this was only an issue on nightly. It's possible that the error seen on stable is a manifestation of not using unstable features I had been relying on (box patterns, in this case).