Open limira opened 6 years ago
there_is_a_bug! { $($tail:tt)* }
will triple the size of $tail
on each invocation, you might be hitting exponential blowup here which the recursion limit can't catch.
... I think I could just partially understand what you say! My impression is that the macro-expansion go depth-first
, so I think it should hit recursion limit very soon.
I tried with trace_macros!(true);
but cargo build
just keep running without any output. In the actual code, which is much bigger than there_is_a_bug!
, it is really hard to find where cause the problem because the compiler is unable to help in this case!
So, expansion does go depth first. What's blowing up is the argument list:
there_is_a_bug! { }
// expands to
there_is_a_bug! { :tt }
// expands to
there_is_a_bug! { : :tt tt :tt }
// expands to
there_is_a_bug! { : :tt : :tt tt :tt tt :tt : :tt tt:tt }
// ...
In order to emit a recursion error here without first blowing the heap, I guess it would have to produce the inner token stream lazily. I'm not sure how much this could impact performance?
I tried with trace_macros!(true); but cargo build just keep running without any output.
Arrhhh! :) The old trace_macros
circa 1.0 could have helped here (the one which simply dumped the invocations directly to stdout, as opposed to the new one which goes through the diagnostic APIs).
Oh, thank you very much! What is still remain as a question for me is that: why the memory usage increase so slow!? When run cargo build
the minimal code above, I open System monitor
and watch rustc
. I expect that its memory usage must increase drastically, but in contrast to my expectation, it increase... slowly. But I am glad that it does not eat out all my memory, because when my system (Fedora 27) is out of memory I don't know what to do except cutting it power.
What is still remain as a question for me is that: why the memory usage increase so slow!?
This is an interesting question, so I tried taking a look into it using perf
. Unfortunately there isn't much I can say with any confidence!
By the way, when I mentioned "the old trace_macros
," this is actually a thing you can do:
rustup toolchain add 1.5.0
cargo +1.5.0 build
If your crate begins with
#![feature(trace_macros)]
trace_macros!(true);
then it will do the tracing to stdout, even though #![feature]
isn't supposed to work on stable compilers. (what happens is that macro expansion occurs before the compiler checks whether you are actually allowed to use #[feature]
flags!)
One last thing: Here's a flamegraph generated from a stage 1 debug compiler. This is a non-optimized compiler since the optimized debug compiler was producing callstacks that provided zero context for any of the functions of intrest (the second item in the callstack was <unknown>
at some raw address).
Flamegraph: perf.svg.gz
It suggests the following:
push_match
is frequently being called in places where it must perform an O(n) clone (34661 samples for Vec::clone
) just to push a single item onto a vec (46 samples for Vec::push
).macro_parser::parse
very frequently drops an ArrayVec<MatcherPos>
which requires several layers of deinitialization code to decrement refcounts in NamedMatch
es.What I do not know, however, is whether these are performance bugs that can be fixed, or if they are concessions that were made in favor of optimizing hotter codepaths.
By the way, when I mentioned "the old trace_macros," this is actually a thing you can do
Thanks for the info! Maybe the current trace_macros!(true);
need to be improved to help cases like this without having to go back time.
I'm going to create a separate issue for the lack of trace_macros!
output, so that this issue can be about the performance of this macro's expansion (which I think ought to hit some sort of error more quickly)
(I think this should be tagged with I-slow
or something similar, can anyone add that?)
This seems like a perfect fit for immutable datastructures. Only cloning if something is actually modified, otherwise just keep adding stuff to the end
Sounds like a good idea at least from a textbook perspective; though I wonder about cache locality, and whether it really is necessary to have all of these intermediate values hanging around with one less item in the first place. The multiple levels of recursively-called drops in the flame-graph indicate to me that refcounts drop to zero on a regular basis.
(that said, I myself don't really understand what the method is doing or why it is written this way)
I still have a perverse fascination with why there's so much wasted sharing here, but still don't really understand what matches
contains even with all of the nice documentation there. (I'd like to do some "println reverse engineering" to help ground my interpretations, but don't currently have a machine that isn't busy running some heavy computations!).
cc @mark-i-m who added much of this documentation recently and probably has a much better grasp of what the code is doing. Any thoughts?
@ExpHP I assume you mean this?
This field contains token trees that match against metavariables. Consider the following example:
// Macro definition
macro_rules! foo {
($($id:ident),*) => {}
}
// Invocation
let w;
let x;
let y;
let z;
foo!(w, x, y, z);
In this example $id
is a metavariable. It will match against four different token trees: w
, x
, y
, and z
, so the contents of matches
would be those token trees.
Does that make sense?
Regarding the inefficiencies of the code, it's been a while since I looked at it, so I'm not sure off the top of my head if/why such things might be necessary. I don't think the original authors made too much effort to make it efficient. Recently, @nnethercote when through and reduced the number of allocations in #50855.
Thanks for responding!
I think my greater confusion is with respect to when the Rc<Vec<_>>
s are ever meaningfully shared. Are we sharing them between multiple branching candidate paths? Are we building a DAG? (this is what seems to be suggested by the Rc<Vec<NamedMatch>>
recursively contained within NamedMatch
).
The comments suggest[^1] there is sharing between the subparser state in a repetition and the top level state, yet it seems that everything in match_lo..=match_hi
is guaranteed to be modified (and thus replicated if the refcount exceeds 1).
[^1] I say this due to the phrase
the
matches
field is shared (hence theRc
) among all "nested" matchers
though the next paragraph seems to contradict it:
Also, note that while we are descending into a sequence, matchers are given their own
matches
vector.
Ah, I see what you mean.
I'm not 100% sure, but I think there are a two different reasons why that exists:
MatcherPos
s are often cloned to produce the next position. We want to avoid cloning lots of Vec
s and TokenTree
s. Those clones appear in these two locations:NamedMatch::MatchedSeq
can contain a collection of matches for a single metavariable that are shared by at least two possible matcher positions.In particular, consider the following example. We want to match foo foo foo
against $($x:ident)+
. After processing the first foo
, the matcher position looks like this (where o
is the "dot"): $($x:ident o)+
. There are two possible next matcher positions:
$(o $x:ident)+
$($x:ident)+ o
Both of these new matcher positions are added to cur_items
. In the first case, we just reuse item
after updating it:
and
In the second case, we clone the up
matcher (the context right before we entered the sequence), modify it so that the "dot" is after the sequence, and share the new matches we found with the new position. That occurs here:
Does that make sense?
Thanks @mark-i-m, that was a huge help!
I have identified the problem. Furthermore, I now realize that this is a much broader performance issue than recursive macro calls (which is actually unrelated). For instance, it also affects the following:
macro_rules! idents {
($($ident:ident)*) => { };
}
idents!{
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
// ... 65536 copies of "a" in total ...
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
}
My current understanding of the issue is as follows:
Any repetition will incur O(n^2)
work as it matches n
copies of the repetition, if it meets both of the following conditions:
...in other words, most (though not all) tt
munchers!
The reason:
inner_parse_loop
reaches the end of the repetition (Graphic: $($t:tt o)*
), it yields two items:
eof_items
. (Graphic: $($t:tt)* o
)bb_items
. (Graphic: $(o $t:tt)*
)fn parse()
, it checks for EOF. Assuming this fails, eof_items
is no longer needed (but it is not yet dropped!)bb_item
:
https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L743
However, the Rc<Vec<NamedMatch>>
is shared with eof_item
, so this implicitly clones the Vec
.eof_items
at the end of the loop { }
:
https://github.com/rust-lang/rust/blob/afaa40646542ca1e8fadb7e1d34197237e0fcd19/src/libsyntax/ext/tt/macro_parser.rs#L756With a small hack to drop eof_items
before modifying bb_item
, @limira's code sample now only takes 10 or so seconds to eat all of my memory and send my hard drive thrashing.
...hopefully, some working code compiles faster as well!
This is both good and bad news! The bad news is that if I do something like this in the future, my system (Fedora) will completely not responding (I really miss Windows went it come to this situation). Maybe it's time to learn how to use OOM Killer
!
When the memory usage increase drastically, maybe we should assume that there is something that is incorrect (then stop/report an error)?
@limira I personally use the limits built into linux. People call me crazy for doing this, but having a guard rail is nice. (so thankfully, I was joking when I said the code snippet sent my hard drive thrashing!)
I don't know if this is the smartest way to do it, but:
/etc/security/limits.conf
lampam hard as 8000000
When I need to temporarily override it, I switch to root to issue a ulimit -v
command.
$ ulimit -v
8000000
$ sudo su
# ulimit -v unlimited
# su lampam # su again to create a nested session; 'exit' would undo the new limit
$ ulimit -v
unlimited
The reason this saves your system from locking up is because it limits how much can be moved to swap (and in my (limited) experience it is heavy usage of swap that is usually responsible for the disk thrashing and the apparent system lockup when too much memory is used by a single application in linux)
Maybe it's time to learn how to use OOM Killer!
OOM killer is a feature of the Linux kernel, not the compiler. When the kernel can't allocate any more physical memory, it begins swapping (which is the reason your system becomes largely unresponsive). (Control-c or killing the rustc process some other way would help). If the kernel also runs out of swap space, it chooses some process to "OOM kill", usually the one with the most memory usage.
I think so far the policy of rustc has been to allow the compiler's memory usage to be unbounded, though there are some efforts to detect infinite loops in const fn evaluation.
@ExpHP: Thanks for the tip! I will try it. @mark-i-m: Thanks for your info!
As in this comment (above), my buggy macro there_is_a_bug!
suffers exponential blowup
(triple its size on each call). Lets talk about the smallest exponential blowup, that it just double its size. In recursive macro, each invoke of the macro consume c
(constant) tokens, but itself double the number of tokens that it is passing to the next call. If the first number of involved token is just n
. After (about?) 32 recursive calls, the number of tokens that is required to be process could be: n * 2^32 - 32*c
. This is an unreasonable number compare to the first number of tokens: n
. Therefore, I think the solution can be very simple:
recursion_limit
), maybe its name is: max_macro_grow_factor = 1024
.n
m
m > n * max_macro_grow_factor
then
max_macro_grow_factor = 2048
if they are sure there is no problem with their macro.I think this is better than just let the compiler blindly run into the exponential blowup
Food for the bikeshed: The current recursion_limit
seems like a limit on the depth of the macro call stack, while this new suggestion seems like a limit on the breadth of the macro call stack (where "breadth" is not number of calls, but total number of tokens in those calls).
A mistake like what I did here is rare, but it could happen to anyone. Before the fix by ExpHP, I wait for it more than 10 minutes (on my system), cargo build
just keep running. After the fix, ExpHP build my code and it go out of memory about 10 seconds (on his system). Both cases, there is no message from the compiler. With a max_macro_grow_factor = 1024
, and assume that the macro just simply double its size. The compiler will stop after just about 10 calls
, then trace_macros!(true)
can jump in to give some help.
Note that, even prior to my PR, there are still other macros that experience exponential blowup:
(do not compile this!)
macro_rules! blowup {
// (the brackets work around the performance bug)
([$($t:tt)+]) => { blowup!{[$($t:tt)+]};
}
fn main() {
blowup!([a]);
}
I started a thread on internals for discussion on possible ways the compiler could help with such rogue macros:
https://internals.rust-lang.org/t/thought-saving-users-from-themselves-in-macros-that-explode/7907
We definitely need help from the compiler in these situation! Thanks for the push!
I know what is my mistake, ~but I think there is a bug in Rust!~ Edit: We just need a better help from the compiler to easily spot out mistake like this.
Edit: the problem seems just involve only the first rule. So the new minimal buggy code is:
cargo build
will run forever! (well, I just wait for it for about ten minutes!)