Bug: a macro never hit recursion limit, `cargo build` run... forever!

limira commented 6 years ago

I know what is my mistake, ~but I think there is a bug in Rust!~ Edit: We just need a better help from the compiler to easily spot out mistake like this.

Edit: the problem seems just involve only the first rule. So the new minimal buggy code is:

macro_rules! there_is_a_bug {
    ( $id:ident: $($tail:tt)* ) => {
        there_is_a_bug! { $($tail:tt)* } // I dit it wrong here `:tt`
    };
}
fn main() {
    there_is_a_bug! { something: more {} }
}

cargo build will run forever! (well, I just wait for it for about ten minutes!)

rustup show
stable-x86_64-unknown-linux-gnu (default)
rustc 1.27.0 (3eda71b00 2018-06-19)

jonas-schievink commented 6 years ago

there_is_a_bug! { $($tail:tt)* } will triple the size of $tail on each invocation, you might be hitting exponential blowup here which the recursion limit can't catch.

limira commented 6 years ago

... I think I could just partially understand what you say! My impression is that the macro-expansion go depth-first, so I think it should hit recursion limit very soon.

I tried with trace_macros!(true); but cargo build just keep running without any output. In the actual code, which is much bigger than there_is_a_bug!, it is really hard to find where cause the problem because the compiler is unable to help in this case!

ExpHP commented 6 years ago

So, expansion does go depth first. What's blowing up is the argument list:

there_is_a_bug! { }
// expands to 
there_is_a_bug! { :tt }
// expands to 
there_is_a_bug! { : :tt tt :tt }
// expands to 
there_is_a_bug! { : :tt : :tt tt :tt tt :tt : :tt tt:tt }
// ...

In order to emit a recursion error here without first blowing the heap, I guess it would have to produce the inner token stream lazily. I'm not sure how much this could impact performance?

I tried with trace_macros!(true); but cargo build just keep running without any output.

Arrhhh! :) The old trace_macros circa 1.0 could have helped here (the one which simply dumped the invocations directly to stdout, as opposed to the new one which goes through the diagnostic APIs).

limira commented 6 years ago

Oh, thank you very much! What is still remain as a question for me is that: why the memory usage increase so slow!? When run cargo build the minimal code above, I open System monitor and watch rustc. I expect that its memory usage must increase drastically, but in contrast to my expectation, it increase... slowly. But I am glad that it does not eat out all my memory, because when my system (Fedora 27) is out of memory I don't know what to do except cutting it power.

ExpHP commented 6 years ago

What is still remain as a question for me is that: why the memory usage increase so slow!?

This is an interesting question, so I tried taking a look into it using perf. Unfortunately there isn't much I can say with any confidence!

amateur hour theatre: `perf` edition

`perf stat` shows: ``` Performance counter stats for 'rustc src/main.rs': 30,455,642,992 instructions # 0.61 insn per cycle (50.05%) 50,185,555,208 cycles (29.97%) 17,998,100,630 L1-dcache-loads (39.92%) 610,971,632 L1-dcache-load-misses # 3.39% of all L1-dcache hits (49.93%) 77,208,094 dTLB-load-misses (49.93%) 1,831,023,187 cache-references (39.98%) 5,621,259,040 branch-instructions (50.00%) 9,967,961 branch-misses # 0.18% of all branches (50.05%) 8,744,794,469 L1-icache-loads (40.08%) 59,622,536 L1-icache-misses # 0.68% of all L1-icache hits (40.04%) ``` 3.3% cache misses sounds like maybe a lot for what is mostly a bunch of similar code running without allocating memory at a fast rate; but I'm not sure. --- The three hottest functions identified by a basic `perf record` are: ``` % samples object demangled name 47.93% libsyntax as Clone>::clone 24.47% libsyntax as Drop>::drop 16.79% libsyntax >::cloned ``` What are the `T`s? Well, to be honest, I don't know. Looking at the disassembly output my best guess is that in all three cases, `T` is `ext::tt::macro_parser::NamedMatch`. **Aside:** I see calls to a function called `::clone@plt`, which point to this funny looking stub in the `.plt` section. It's hard for me to understand why `::clone` has *any* presence in an optimized binary! --- What's calling those three functions so often? Unfortunately, `perf record -g` is giving me useless callstacks with no information, even on an (optimized) compiler built with debug info. That said, there aren't very many clones in `macro_parser.rs`. One thing does stand out to me which is ```rust impl<'a> MatcherPos<'a> { /// Add `m` as a named match for the `idx`-th metavar. fn push_match(&mut self, idx: usize, m: NamedMatch) { let matches = Rc::make_mut(&mut self.matches[idx]); matches.push(m); } } ``` which potentially clones a `Vec` (in `Rc::make_mut`) just to push a single item. However I can't find any "smoking gun" evidence that this is the problem.

ExpHP commented 6 years ago

By the way, when I mentioned "the old trace_macros," this is actually a thing you can do:

rustup toolchain add 1.5.0
cargo +1.5.0 build

If your crate begins with

#![feature(trace_macros)]
trace_macros!(true);

then it will do the tracing to stdout, even though #![feature] isn't supposed to work on stable compilers. (what happens is that macro expansion occurs before the compiler checks whether you are actually allowed to use #[feature] flags!)

ExpHP commented 6 years ago

One last thing: Here's a flamegraph generated from a stage 1 debug compiler. This is a non-optimized compiler since the optimized debug compiler was producing callstacks that provided zero context for any of the functions of intrest (the second item in the callstack was <unknown> at some raw address).

Flamegraph: perf.svg.gz

It suggests the following:

As I suspected, push_match is frequently being called in places where it must perform an O(n) clone (34661 samples for Vec::clone) just to push a single item onto a vec (46 samples for Vec::push).
macro_parser::parse very frequently drops an ArrayVec<MatcherPos> which requires several layers of deinitialization code to decrement refcounts in NamedMatches.

What I do not know, however, is whether these are performance bugs that can be fixed, or if they are concessions that were made in favor of optimizing hotter codepaths.

limira commented 6 years ago

By the way, when I mentioned "the old trace_macros," this is actually a thing you can do

Thanks for the info! Maybe the current trace_macros!(true); need to be improved to help cases like this without having to go back time.

ExpHP commented 6 years ago

I'm going to create a separate issue for the lack of trace_macros! output, so that this issue can be about the performance of this macro's expansion (which I think ought to hit some sort of error more quickly)

ExpHP commented 6 years ago

(I think this should be tagged with I-slow or something similar, can anyone add that?)

oli-obk commented 6 years ago

This seems like a perfect fit for immutable datastructures. Only cloning if something is actually modified, otherwise just keep adding stuff to the end

ExpHP commented 6 years ago

Sounds like a good idea at least from a textbook perspective; though I wonder about cache locality, and whether it really is necessary to have all of these intermediate values hanging around with one less item in the first place. The multiple levels of recursively-called drops in the flame-graph indicate to me that refcounts drop to zero on a regular basis.

(that said, I myself don't really understand what the method is doing or why it is written this way)

ExpHP commented 6 years ago

I still have a perverse fascination with why there's so much wasted sharing here, but still don't really understand what matches contains even with all of the nice documentation there. (I'd like to do some "println reverse engineering" to help ground my interpretations, but don't currently have a machine that isn't busy running some heavy computations!).

cc @mark-i-m who added much of this documentation recently and probably has a much better grasp of what the code is doing. Any thoughts?

mark-i-m commented 6 years ago

@ExpHP I assume you mean this?

https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L171

This field contains token trees that match against metavariables. Consider the following example:

// Macro definition
macro_rules! foo {
    ($($id:ident),*) => {}
}

// Invocation
let w;
let x;
let y;
let z;
foo!(w, x, y, z);

In this example $id is a metavariable. It will match against four different token trees: w, x, y, and z, so the contents of matches would be those token trees.

Does that make sense?

Regarding the inefficiencies of the code, it's been a while since I looked at it, so I'm not sure off the top of my head if/why such things might be necessary. I don't think the original authors made too much effort to make it efficient. Recently, @nnethercote when through and reduced the number of allocations in #50855.

ExpHP commented 6 years ago

Thanks for responding!

I think my greater confusion is with respect to when the Rc<Vec<_>>s are ever meaningfully shared. Are we sharing them between multiple branching candidate paths? Are we building a DAG? (this is what seems to be suggested by the Rc<Vec<NamedMatch>> recursively contained within NamedMatch).

The comments suggest[^1] there is sharing between the subparser state in a repetition and the top level state, yet it seems that everything in match_lo..=match_hi is guaranteed to be modified (and thus replicated if the refcount exceeds 1).

[^1] I say this due to the phrase

the matches field is shared (hence the Rc) among all "nested" matchers

though the next paragraph seems to contradict it:

Also, note that while we are descending into a sequence, matchers are given their own matches vector.

mark-i-m commented 6 years ago

Ah, I see what you mean.

I'm not 100% sure, but I think there are a two different reasons why that exists:

MatcherPoss are often cloned to produce the next position. We want to avoid cloning lots of Vecs and TokenTrees. Those clones appear in these two locations:

https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L486

https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L539

NamedMatch::MatchedSeq can contain a collection of matches for a single metavariable that are shared by at least two possible matcher positions.

In particular, consider the following example. We want to match foo foo foo against $($x:ident)+. After processing the first foo, the matcher position looks like this (where o is the "dot"): $($x:ident o)+. There are two possible next matcher positions:

Start a new repetition: $(o $x:ident)+
Done repeating: $($x:ident)+ o

Both of these new matcher positions are added to cur_items. In the first case, we just reuse item after updating it:

https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L511

and

https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L519

In the second case, we clone the up matcher (the context right before we entered the sequence), modify it so that the "dot" is after the sequence, and share the new matches we found with the new position. That occurs here:

https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L486-L498

Does that make sense?

ExpHP commented 6 years ago

Thanks @mark-i-m, that was a huge help!

I have identified the problem. Furthermore, I now realize that this is a much broader performance issue than recursive macro calls (which is actually unrelated). For instance, it also affects the following:

macro_rules! idents {
    ($($ident:ident)*) => { };
}

idents!{
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
 //     ...  65536 copies of "a" in total ...
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
}

My current understanding of the issue is as follows:

Any repetition will incur O(n^2) work as it matches n copies of the repetition, if it meets both of the following conditions:
- the first thing in the repetition is a named matcher
- the repetition is at the end of the macro pattern
...in other words, most (though not all) tt munchers!

The reason:

Every time inner_parse_loop reaches the end of the repetition (Graphic: $($t:tt o)*), it yields two items:
1. The EOF position, in eof_items. (Graphic: $($t:tt)* o)
2. The "repeating" position, in bb_items. (Graphic: $(o $t:tt)*)
Back in fn parse(), it checks for EOF. Assuming this fails, eof_items is no longer needed (but it is not yet dropped!)
It parses the nonterminal and pushes the match onto the bb_item: https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L743 However, the Rc<Vec<NamedMatch>> is shared with eof_item, so this implicitly clones the Vec.
Then, it finally drops eof_items at the end of the loop { }: https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L756

With a small hack to drop eof_items before modifying bb_item, @limira's code sample now only takes 10 or so seconds to eat all of my memory and send my hard drive thrashing.

...hopefully, some working code compiles faster as well!

limira commented 6 years ago

This is both good and bad news! The bad news is that if I do something like this in the future, my system (Fedora) will completely not responding (I really miss Windows went it come to this situation). Maybe it's time to learn how to use OOM Killer!

When the memory usage increase drastically, maybe we should assume that there is something that is incorrect (then stop/report an error)?

ExpHP commented 6 years ago

@limira I personally use the limits built into linux. People call me crazy for doing this, but having a guard rail is nice. (so thankfully, I was joking when I said the code snippet sent my hard drive thrashing!)

I don't know if this is the smartest way to do it, but:

/etc/security/limits.conf

    lampam           hard    as              8000000

When I need to temporarily override it, I switch to root to issue a ulimit -v command.

$ ulimit -v
8000000
$ sudo su
# ulimit -v unlimited
# su lampam   # su again to create a nested session; 'exit' would undo the new limit
$ ulimit -v
unlimited

The reason this saves your system from locking up is because it limits how much can be moved to swap (and in my (limited) experience it is heavy usage of swap that is usually responsible for the disk thrashing and the apparent system lockup when too much memory is used by a single application in linux)

mark-i-m commented 6 years ago

Maybe it's time to learn how to use OOM Killer!

OOM killer is a feature of the Linux kernel, not the compiler. When the kernel can't allocate any more physical memory, it begins swapping (which is the reason your system becomes largely unresponsive). (Control-c or killing the rustc process some other way would help). If the kernel also runs out of swap space, it chooses some process to "OOM kill", usually the one with the most memory usage.

I think so far the policy of rustc has been to allow the compiler's memory usage to be unbounded, though there are some efforts to detect infinite loops in const fn evaluation.

limira commented 6 years ago

@ExpHP: Thanks for the tip! I will try it. @mark-i-m: Thanks for your info!

As in this comment (above), my buggy macro there_is_a_bug! suffers exponential blowup (triple its size on each call). Lets talk about the smallest exponential blowup, that it just double its size. In recursive macro, each invoke of the macro consume c (constant) tokens, but itself double the number of tokens that it is passing to the next call. If the first number of involved token is just n. After (about?) 32 recursive calls, the number of tokens that is required to be process could be: n * 2^32 - 32*c. This is an unreasonable number compare to the first number of tokens: n. Therefore, I think the solution can be very simple:

Define a constant (like recursion_limit), maybe its name is: max_macro_grow_factor = 1024.
Record the first number of tokens: n
At each recursive call, check the new number of tokens: m
If m > n * max_macro_grow_factor then
- stop the macro-expansion process
- report error to user
- suggest them to check the involving rules
- suggest them to increase max_macro_grow_factor = 2048 if they are sure there is no problem with their macro.

I think this is better than just let the compiler blindly run into the exponential blowup

Ixrec commented 6 years ago

Food for the bikeshed: The current recursion_limit seems like a limit on the depth of the macro call stack, while this new suggestion seems like a limit on the breadth of the macro call stack (where "breadth" is not number of calls, but total number of tokens in those calls).

limira commented 6 years ago

A mistake like what I did here is rare, but it could happen to anyone. Before the fix by ExpHP, I wait for it more than 10 minutes (on my system), cargo build just keep running. After the fix, ExpHP build my code and it go out of memory about 10 seconds (on his system). Both cases, there is no message from the compiler. With a max_macro_grow_factor = 1024, and assume that the macro just simply double its size. The compiler will stop after just about 10 calls, then trace_macros!(true) can jump in to give some help.

ExpHP commented 6 years ago

Note that, even prior to my PR, there are still other macros that experience exponential blowup:

(do not compile this!)

macro_rules! blowup {
  // (the brackets work around the performance bug)
  ([$($t:tt)+]) => { blowup!{[$($t:tt)+]};
}

fn main() {
  blowup!([a]);
}

I started a thread on internals for discussion on possible ways the compiler could help with such rogue macros:

https://internals.rust-lang.org/t/thought-saving-users-from-themselves-in-macros-that-explode/7907

limira commented 6 years ago

We definitely need help from the compiler in these situation! Thanks for the push!

rust-lang / rust

Bug: a macro never hit recursion limit, `cargo build` run... forever! #51754