Replace stack overflow checking with stack probes

Zoxc commented 10 years ago

We currently abuse LLVM segmented stack support to check for stack overflows. It would be more efficient to use guard pages to detect these. We already have guard pages on all the stacks. However to ensure that the code doesn't skip the guard pages, we need to insert stack probes. LLVM already can generate code for that on x86 and ARM for Windows. We'd just need to expose that as an option on other platforms.

It would be nice if support for stack probes could be added to MIPS in LLVM too, so we can get rid of the runtime support for the stack overflow checking.

Using stack probes is also easy and desirable to support in freestanding mode.

alexcrichton commented 10 years ago

Here's a comment in the past about this. Sadly I don't think "just use a guard page will cut it" for the reasons outlined in that comment. That being said, I'd love to stop using segmented stacks!

Zoxc commented 10 years ago

I do think printing out a error message is a bad thing. On Windows it means interfering with the dialog which informs the user about an error and allows developers to debug the application. On Linux it interferes with debugging. If such a message desired for some reason, we can support it, but the POSIX solution probably isn't very pretty.

I was wondering what the impact of this would be, so I commented out the code generating the split stack attributes and did some simple benchmarks:

2.35% reduced size of librustc 4.11% less time compiling libcore 19.3% less time running shootout-fibo

thestinger commented 10 years ago

Printing an error message on stack overflow would be trivial if there weren't green threads. It's as simple as installing a signal handler and print out an error if the address is located in the guard page range. It's yet another case where green threads make the language significantly worse than C++. I don't think there's a sane way to do it without horrific spin locks in today's Rust unless printing a special error message is not a requirement.

thestinger commented 10 years ago

Anyway... the segmented stack support only catches overflow when it happens to occur on the Rust side rather than in C code that's being called. It doesn't actually work. For example, an infinitely recursive function may be allocating memory and there's a good chance it will be jemalloc triggering the overflow.

Zoxc commented 10 years ago

We can give error messages for libgreen by having it inform libnative about the active guard page. Foreign code might skip guard pages on non-Windows platforms, so we won't catch all stack overflows.

thestinger commented 10 years ago

It's the platform's problem if it doesn't build with -fcheck-stack. Stack frames that large aren't common in C code anyway.

pcwalton commented 10 years ago

@thestinger Knock off the general comments about how Rust is a better or worse language than C++ please.

pcwalton commented 10 years ago

Anyway, I totally agree with the thrust of using guard pages here, and would like us to move away from segmented stacks as soon as possible. I personally don't think good error messages in libgreen should block us. The fibo benchmark is particularly compelling.

I don't see why libgreen is relevant anyway; if there's a guard page either way, just have libgreen handle it in the same way as libnative. It should be easy to run enough of Rust from the signal handler to print out an error message before dying.

huonw commented 10 years ago

There was some discussion on the LLVM mailing list about this:

Zoxc commented 10 years ago

That may have been me.

thestinger commented 10 years ago

http://togototo.wordpress.com/2014/08/05/fibonacci-numbers-on-the-galaxy-s3-arm-benchmarks-of-rust-ocaml-haskell-go-racket-lua-c-and-java/

Zoxc commented 10 years ago

I do have an implementation of the LLVM part for x86 and perhaps ARM (not sure how comprehensive that support is). https://github.com/Zoxc/llvm/compare/llvm-mirror:master...stprobe

I don't know enough about MIPS to implement that in LLVM or enough about MIPS and ARM to implement the __probestack support function.

huonw commented 10 years ago

http://togototo.wordpress.com/2014/08/05/fibonacci-numbers-on-the-galaxy-s3-arm-benchmarks-of-rust-ocaml-haskell-go-racket-lua-c-and-java/

That fibonacci implementation is tail recursive, and optimises to a loop:

else-block.i.i.i:                                 ; preds = %else-block.i.i.i.preheader, %else-block.i.i.i
  %.tr710.i.i.i = phi i64 [ %107, %else-block.i.i.i ], [ %104, %else-block.i.i.i.preheader ]
  %.tr69.i.i.i = phi i64 [ %106, %else-block.i.i.i ], [ 1, %else-block.i.i.i.preheader ]
  %.tr8.i.i.i = phi i64 [ %.tr69.i.i.i, %else-block.i.i.i ], [ 0, %else-block.i.i.i.preheader ]
  %106 = add i64 %.tr8.i.i.i, %.tr69.i.i.i
  %107 = add i64 %.tr710.i.i.i, -1
  %108 = icmp eq i64 %107, 0
  br i1 %108, label %_ZN3fib20hcf5cee5c8487747eOaaE.exit.i.loopexit, label %else-block.i.i.i

thestinger commented 10 years ago

@huonw: It's probably just the difference in compilers then, never mind. Rust is actually a bit faster than the C code on x86_64 when using 32-bit integers.

huonw commented 10 years ago

On x86-64 actually seems to be using a 64-bit int in Rust vs. the 32-bit one in C. Changing to i32 throws Rust in a much better light:

75699
LANGUAGE Rust 2167
75699
LANGUAGE C 2220
75699
LANGUAGE C 2271

(Anyway, this is off-topic for this bug, although @pcwalton may be interested in it.)

Zoxc commented 10 years ago

Would not having safe stacks on MIPS block landing this?

Do we definitely want a signal handler/exception handler to print out that a stack overflow happened?

geofft commented 9 years ago

I'm trying to get a sense of where this is / how to bring this forward and I'm confused about a few things:

What's the story here with MIPS? Am I reading correctly that LLVM can already do this for most Linux architectures but not MIPS? (What makes MIPS different from any other CPU with a stack pointer?)
What runtime support is needed? I believe that GCC's -fstack-check=generic works by just inserting an inline loop to probe the stack a page at a time -- is having a separate __probestack useful? It seems like avoiding the addition to compiler-rt entirely would help this patch get landed and would also simplify the necessary change to Rust itself. (On Windows, where __chkstk already exists, LLVM can continue to call into that.)
There's a mention on the LLVM mailing list about people who might want __probestack not to use signals, but it doesn't, itself, right? It just induces a SIGSEGV. Isn't the answer to that use case to abuse LLVM segmented stack support / __morestack as Rust currently does?
Where should I look for your current LLVM and compiler-rt (if any) patches?

My personal opinion is that if this can be turned on for the common platforms (at least Linux on x86-32, x86-64, maybe ARM), I'd be a lot more comfortable with #27388. But if there's no Linux support at all right now, then dropping stack checks is a bit more worrisome.

nagisa commented 9 years ago

What's the story here with MIPS? Am I reading correctly that LLVM can already do this for most Linux architectures but not MIPS? (What makes MIPS different from any other CPU with a stack pointer?)

As far as MIPS goes, @Zoxc couldn’t find anybody who knows MIPS assembly to implement probing, hence the question.

AFAIR no, it hasn’t landed yet.

What runtime support is needed?

~~Absolutely none!~~ It depends on implementation. Reading the LLVM patch, apparently, __probestack function will have to be defined.

There's a mention on the LLVM mailing list about people who might want probestack not to use signals, but it doesn't, itself, right? It just induces a SIGSEGV. Isn't the answer to that use case to abuse LLVM segmented stack support / morestack as Rust currently does?

__morestack and probes are fundamentally functionally different. Different enough that it makes little sense to try emulate probing with __morestack.

Where should I look for your current LLVM and compiler-rt (if any) patches?

Here and here. Not sure whether this is the most recent patch-set, though.

klutzy commented 9 years ago

Most recent llvm-dev discussion (2015-07-26)

bstrie commented 8 years ago

I may be going out on a limb here, but I'm tagging this with I-unsound since stack overflow is still a theoretical attack vector (see https://github.com/rust-lang/rust/pull/27338 for more discussion on this).

I think we need to refocus this discussion since a lot has change since the bug was first opened (I'm almost tempted to open an entirely new bug). Specifically, where are we at now for supporting stack probes on all platforms (not just first-tier platforms), what needs to be done to accomplish this, and who has the expertise needed to implement it?

nagisa commented 8 years ago

http://reviews.llvm.org/D12483 seems to be the most recent patch against llvm.

bstrie commented 8 years ago

@nagisa That's still in review, yes? It also seems to be a bit tentative.

Are you implying that the next step to closing this issue is "wait for LLVM to support what we need"? If so, then what would need to be done on our end once that support appears? How involved a change would it be?

nagisa commented 8 years ago

Are you implying that the next step to closing this issue is "wait for LLVM to support what we need"?

Alternative would be implementing and testing out similar support in our fork of LLVM. If we really want this to get fixed faster, then this is certainly an option. I believe we do not quite support external LLVM anyway at the moment.

If so, then what would need to be done on our end once that support appears? How involved a change would it be?

It then just comes down to annotating every function with probe-stack attribute by default. Then LLVM would add probe for functions that do in fact need it. I believe we already activate some attributes by default, so activating one more shouldn’t be too involved.

bstrie commented 8 years ago

Nominating as this is a soundness bug that has yet to have a priority assigned.

nikomatsakis commented 8 years ago

triage: P-medium

It'd be good to get some clarification from @alexcrichton (or someone) as to the current state of guard pages etc and what the precise risk is here. I tried following up on the links but there was a lot to read!

nagisa commented 8 years ago

Our implementation of guard pages is good (don’t remember if there’s implementation for all non-tier1 platforms) and work well/correctly for both main thread (implicitly created by OS) and threads created using the standard APIs (we create the guard page ourselves).

There’s a risk to read/(over-)write data that does not belong to us/isn’t on the stack page (i.e. is outside the stack) only in very specific circumstances:

The function must “allocate” more than one page¹ of stack memory (size of which depends on system configuration); AND
The stack memory region shadowing the guard page(s) must not be written to/read from/executed: stack memory is initialized in a way that’s guaranteed (to my knowledge) to hit the guard page first if there’s at least some part of the guard page that’s not shadowed by uninitialized stack memory; AND
One must guess correctly, on the first try which exact pages are mapped past the guard page that has been just shadowed by uninitialized memory to (ASLR which is on by default on all T1 platforms would change the addresses and offsets on the next run).

So, while technically this is a soundness bug, it is hard to imagine ever seeing this being abused in any way. Exposing variable length stack-allocated arrays would make this easier to abuse, but that’s not happening in my knowledge.

¹: Or however many pages guard pages use on a given OS.

Ah, and Windows is not affected since it already has stack probes.

alexcrichton commented 8 years ago

Yes I believe @nagisa is correct on all accounts.

If and when LLVM has support for stack probes on all platforms, seems like we should enable!

phil-opp commented 8 years ago

stack memory is initialized in a way that’s guaranteed (to my knowledge) to hit the guard page first if there’s at least some part of the guard page that’s not shadowed by uninitialized stack memory

It seems like this isn't always the case. For example:

fn stack_overflow() {
    let x = [0u8; 999999999];
}

Playpen: http://is.gd/g8ZDkD

From the assembly output it seems like the array initialization starts at the bottom:

...
.Ltmp8:
    subq    $999999888, %rsp        ; subtract 999999888 from the stack pointer
    xorl    %eax, %eax
    movl    %eax, %ecx
    leaq    -1000000000(%rbp), %rdx
    movb    $61, -1(%rbp)
    movq    %rdx, -1000000008(%rbp)
    movq    %rcx, -1000000016(%rbp)
.LBB1_1:
    movq    -1000000016(%rbp), %rax
    movq    -1000000008(%rbp), %rcx
    movb    $0, (%rcx,%rax)         ; write a 0 byte to memory at (rcx, rax)
    addq    $1, %rax                ; increase rax by 1
    cmpq    $999999999, %rax
    movq    %rax, -1000000016(%rbp)
    jb  .LBB1_1
    .loc    1 9 0 prologue_end
...

I ran into this when I added guard pages to the kernel stack in my toy OS. If the array size was big enough, the code would miss the guard page and mess up page tables.

nagisa commented 8 years ago

@phil-opp great observation! We want to initialize array from beginning to the end¹ (as we do here), but the array is laid onto stack reversed (i.e. the first element is at the head of the stack and the last element is closer to the beginning of the stack).

So… this is way easier to abuse than I initially claimed.

²: and since this code optimises down to memset@PLT, we can’t really tell in which direction initialisation really happens, anyway.

whitequark commented 8 years ago

What happens on platforms that don't have stack probes? I.e. anything MMU/MPU-less. Right now it seems there wouldn't be any way to reenable stack overflow checking, and while it's possible to write a custom LLVM pass and enable it via rustc -C llvm-args=-load=liboverflowcheck.so, requiring users of something like http://zinc.rs to check out rustc's LLVM, build it, take care to keep it in sync with upstream and finally build a pass seems extremely hostile.

bharrisau commented 8 years ago

The [better] solution for MxU-less baremetal systems is to grow the stack out of the RAM instead of into the heap (i.e. have the stack lower than the heap). It is the sanest solution, but a little trickier to do with ld. (you need to specify the stack size instead of letting it be the space remaining after the heap)

whitequark commented 8 years ago

@bharrisau Sure, but that does not work if you have more than one stack.

alexcrichton commented 8 years ago

@whitequark I would suspect that any "flavorful" platforms would just have stack probes disabled (e.g. it'd be a custom-target-spec option).

whitequark commented 8 years ago

@alexcrichton are you suggesting that Rust will be inherently memory-unsafe even in safe code on every MPU-less platforms? That's quite crippling especially because there is no MPU.

whitequark commented 8 years ago

This is not a theoretical concern. On targets with little RAM, the memory layout is quite packed and small stacks directly translate to reduced device cost. Stack overflow checking is a desirable feature, e.g. FreeRTOS has their own implementation. Of course, it's not actually guaranteed to catch all stack overflows; Rust is capable of doing that and there is no excuse not to.

whitequark commented 8 years ago

An ideal solution would be an ability to specify a symbol (LLVM global) holding the current stack limit, with the symbol name being configurable. An RTOS then would update it every time it transitions to a new stack.

ranma42 commented 8 years ago

@whitequark in a multi-threading environment the stack (and possibly also the stack limit) should be per-thread, hence the updates and checks you are mentioning should be on a thread-local variable. What you are proposing looks to me like software MMU emulation or explicit allocation (each call tries to allocate a stack frame from the fixed-size stack vector). I am afraid this pattern would make even simple functions significantly more complex, possibly preventing many basic optimisations. Moreover, outside of LLVM the knowledge about how much stack is used is not complete (how many registers are going to be spilled? are allocas optimised away?). I believe that the feature you are requesting should be implemented in LLVM rather then rustc.

whitequark commented 8 years ago

@ranma42 No, I am proposing exactly per-thread checks. It's just that LLVM's lowering of thread-locals is not useful for non-hosted targets, and anyway, nearly always the only reasonable way to implement those is using a regular global variable. In case your LLVM's lowering for your target does support useful thread-locals, you can simply supply one.

Yes. This feature can only be implemented as an LLVM pass; probably no changes to rustc itself are wanted or necessary. However, LLVM is an implementation detail. Rust claims to provide memory safety; it is the duty of its compilers to ensure that memory safety can in fact be provided on all platforms. I think that rustc should include such a pass in its fork of LLVM, since it is going to use its own fork for foreseeable future anyway, specifically due to Rust-specific passes.

whitequark commented 8 years ago

The old hack abusing the split-stack machinery was almost what I suggest here; it, however, embedded platform-specific knowledge to generate code extracting stack limit from a thread-local, and LLVM asserted out on any uncommon platform. If it was extended to read the stack limit from a (plain global or thread-local) variable given as an option rather than hardcoding the offsets for Linux, Windows, etc in the backend, it would work perfectly.

nikomatsakis commented 8 years ago

I agree that it is our responsibility to do stack checking, however we achieve it -- but it also seems clear that we want to configure this per target (iow, we do not want to add read/writes of a stack limit variable to every fn when we can use a guard page, etc).

whitequark commented 8 years ago

@nikomatsakis I agree of course, guard page is the best method when it is available. I am only saying that it shouldn't be the only one.

whitequark commented 8 years ago

Speaking of cost of these checks--it is actually fairly low. The prologue code should look something like this (assuming Cortex-M3):

.syntax unified
.text
prologue:                          @    0
  movw    r0, #:lower16:sp_limit   @ +1 1
  movt    r0, #:upper16:sp_limit   @ +1 2
  ldr     r0, [r0]                 @ +2 4
  add     r0, r0, #n               @ +1 5 (n = stack frame size + red zone size)
  cmp     sp, r0                   @ +1 6
  ble     __overflow               @ +1 7 (common case)

.data
sp_limit:
  .long   0

It will take 7 cycles (~100ns on a 72MHz core), assuming sp_limit lives in SRAM, and will inflate non-leaf functions by 20 bytes. (I used movw/movt to avoid wait states for the case where flash runs at a lower frequency than the core, but if a load from a constant island would be used, the size penalty is 14 bytes plus four bytes per 4k of code). There are no caches, so the delay inflicted by memory access is isolated and predictable.

Since red zone is used, leaf functions with small stack frames do not have to pay at all.

alexcrichton commented 8 years ago

@whitequark

It seems pretty reasonable to me that flavorful targets could use the morestack-like stack checking instead of guard pages to ensure that we can have stack checking everywhere. It'd likely require some LLVM modifications to be amenable, but shouldn't necessarily be a showstopper either way?

whitequark commented 8 years ago

@alexcrichton Hm, yes. I've looked at these commits again and removal of __morestack/stack_overflow langitem won't really complicate introduction of such stack checking, so now I don't think I have anything in particular to say about this specific PR.

bstrie commented 8 years ago

For those looking for background info on this bug, and considering that this is the top Google hit for "stack probe", here's an actual definition of stack probes (which AFAICT is missing from this thread):

A stack probe is a sequence of code that the compiler inserts into every function call. When initiated, a stack probe reaches benignly into memory by the amount of space that is required to store the function's local variables.

If a function requires more than size bytes of stack space for local variables, its stack probe is initiated. By default, the compiler generates code that initiates a stack probe when a function requires more than one page of stack space.

https://msdn.microsoft.com/en-us/library/9598wk25.aspx

brson commented 7 years ago

Last time there was a thread on this issue somebody said they were working on it. Does anybody recall who that was, whether there is an updated LLVM fork, or anything on the status of this? This issue comes up far more than activity on this thread would indicate...

Zoxc commented 7 years ago

I heard @whitequark was doing something, but I don't know what. This is my latest LLVM fork

whitequark commented 7 years ago

Finishing this is still on my TODO list.

Zoxc commented 7 years ago

Given that Rust is now used in production. Can we merge changes to support this to our LLVM fork? Ping @nagisa @alexcrichton

whitequark commented 7 years ago

@Zoxc Please simply address the upstream concerns.

rust-lang / rust

Replace stack overflow checking with stack probes #16012