Match expressions use O(n) stack space with n branches in debug mode

evanw commented 8 years ago

I ran into this problem while working on a parser (https://github.com/evanw/esbuild/tree/rust). Here's a reduced test case: https://gist.github.com/evanw/06e074a1d6d5c21e8d32e2c26de07714. It contains two recursive functions, small and large, that each contain a match expression. Every call prints out the amount of stack space used.

In debug:

small:
stack space: 0.3kb
stack space: 0.7kb
stack space: 1.0kb
stack space: 1.4kb
stack space: 1.7kb
stack space: 2.1kb
stack space: 2.4kb
stack space: 2.8kb
stack space: 3.1kb
stack space: 3.4kb
stack space: 3.8kb
large:
stack space: 0.6kb
stack space: 1.3kb
stack space: 1.9kb
stack space: 2.6kb
stack space: 3.2kb
stack space: 3.8kb
stack space: 4.5kb
stack space: 5.1kb
stack space: 5.8kb
stack space: 6.4kb
stack space: 7.0kb

In release:

small:
stack space: 0.0kb
stack space: 0.1kb
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.5kb
stack space: 0.6kb
stack space: 0.7kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.0kb
stack space: 1.1kb
large:
stack space: 0.0kb
stack space: 0.1kb
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.5kb
stack space: 0.6kb
stack space: 0.7kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.0kb
stack space: 1.1kb

I would expect the amount of stack space used by a match expression to be proportional to the stack space of the largest branch, not to the total stack space of all branches. The problem isn't too bad here but it causes my actual parser to use huge amounts of stack space and to crash with a stack overflow when parsing virtually all normal-sized inputs.

Aatch commented 8 years ago

I have investigated this somewhat and have found that the stack usage is from register spilling. For some reason LLVM is spilling the registers to a separate location in each branch around the overflow check. I have no idea why it might be doing this though.

Note that without the println! the IR we produce only creates stack slots for the three arguments.

Aatch commented 8 years ago

/cc @rust-lang/compiler

evanw commented 8 years ago

Oh interesting, yeah maybe it only happens when there are early exits inside a match expression. Removing my heavy use of try! inside the match expressions in my actual parser makes my stack size problem completely disappear, but then of course I can't recover from parse errors.

nikomatsakis commented 8 years ago

I would expect that this may be due to the specifics of the lifetimes that we emit for the bindings.

nikomatsakis commented 8 years ago

Oh, never mind, just saw @aatch's comment.

dotdash commented 8 years ago

@Aatch seems like the fast register allocator spills all live registers at the end of each basic block.

nox commented 6 years ago

I suspect this may have far reaching implications making code worse all over the place. For example Stylo is full of very large match expressions and I wonder if that limitation is making it unnecessarily bloated. I may be completely wrong though, given @Aatch's comment about println!.

Cc @rust-lang/wg-codegen

matklad commented 3 years ago

Triage (1.44.1)

We definitely hit this in debug in rust-analyzer. Our code for expression lowering is a single giant recursive match, and it uses 20k of stack space per recursion level in debug if all branches are there. If I comment out all the branches which are not used in my specific test, stack space usage goes down dramatically.

In release, I think I am seeing stack space proportional to the max branch, as commenting branches doesn't make that huge a difference with --release. I think this is true about the original report as well? With --release, both small and large use the same amount of stack space.

See https://github.com/rust-analyzer/rust-analyzer/pull/5316/commits/12d52a726a1e6ca0f6b08343d13d0f6b50d94d8c for a real-world example of this.

oli-obk commented 3 years ago

Maybe @rust-lang/wg-mir-opt can do something to clean up the match logic to make it easier for llvm to figure out the register spilling correctly.

dotdash commented 3 years ago

The problem here is having live values across BB boundaries, because the register allocator in debug mode simply spills and reloads everything, even for unconditional branches.

Silly example:

define internal i8 @testcase(i8 %0) {
  br label %bb2

bb2:
  ret i8 %0
}

becomes:

testcase:                               # @testcase
  .cfi_startproc
# %bb.0:
                                          # kill: def $dil killed $dil killed $edi
  movb>-%dil, -1(%rsp)          # 1-byte Spill
  jmp>.LBB15_1
.LBB15_1:                               # %bb2
  movb>--1(%rsp), %al           # 1-byte Reload
  retq

And in this example, it's not so much the match itself, but the overflow check that causes values that are live across BB boundaries. Compiling with -Cdebug-assertions=no gives the same stack usage for small and large.

small:
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.6kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.1kb
stack space: 1.3kb
stack space: 1.5kb
stack space: 1.7kb
stack space: 1.9kb
stack space: 2.1kb
large:
stack space: 0.2kb
stack space: 0.4kb
stack space: 0.6kb
stack space: 0.8kb
stack space: 0.9kb
stack space: 1.1kb
stack space: 1.3kb
stack space: 1.5kb
stack space: 1.7kb
stack space: 1.9kb
stack space: 2.1kb

Each overflow check causes two spill/reload pairs. One for token (1 byte) and one for the result of the subtraction. Which, for alignment reasons, adds up to 8 bytes of stack usage each. I'm not sure there's much we can do there in terms of MIR construction, but I'd love to be proven wrong here :-)

Also, a good bit of the stack usage is actually used by the println! call. Without the output, small uses 136 bytes of stack with debug assertions, and large uses 440 bytes. Without debug assertions, both use 40 bytes of stack.

In the general case the difference between debug and release mode, can probably be explained by the fact that in release mode, not only do we get a better register allocator, but we also use lifetime intrinsics in LLVM, which allow stack allocated values that are used in only one arm to share space with values only used in other arms. The latter would explain why the observed stack usage in the rust analyzer example goes from Sum(arms) to Max(arms). Short of doing some form of stack coloring of our own, I don't see a way to improve this in terms of MIR generation either.

oli-obk commented 3 years ago

While we can move the count - 1 computation out of the match arms or create a mir optimization that can figure out that the overflow check is unnecessary, because we already checked for count > 0, this won't help in general. E.g. a developer replacing all the count - 1 with count.wrapping_sub(1) just places a function call where there was an overflow check, giving us the same additional basic block. So assuming we want to handle, yea I agree that we can't do anything with a mir-optimization beyond a bunch of not too-high-hanging fruit.

Kampfkarren commented 3 years ago

This hit full-moon, where users are getting stack overflows for standard input. I am not performing one long match, but several in a row; if this is not the same bug, let me know.

https://users.rust-lang.org/t/stack-overflow-on-debug-mode-but-not-release-cant-figure-out-solution-without-increasing-stack-allocation-for-release/59869

rust-lang / rust

Match expressions use O(n) stack space with n branches in debug mode #34283