Consider removing the need for a "sentinel stack frame" while unwinding

jamesmunns commented 1 year ago

At the moment, it is necessary for probe-run to "run into" a sentinel frame where the link register is recorded as 0xFFFF_FFFF while unwinding (on panic).

https://github.com/rust-embedded/cortex-m-rt/pull/337 was opened to include this in c-m-rt, but it turns out, this caused a pretty serious regression, that we just recently caught.

We're fixing this for cortex-m-rt v0.7.3 (in https://github.com/rust-embedded/cortex-m/pull/463) by pushing another word to the stack, but this means that all targets have 8 bytes of "dead" stack usage, as well as some startup code (and confusing asm) necessary to insert the necessary "full stack frame".

During testing, we noticed that GDB was actually totally happy to not have the sentinel frame at all - only probe-run was upset (it throws a warning that the stack is corrupted).

We're currently looking at whether it would be possible to totally remove the sentinel frame in c-m-rt. Ideally, this would involve probe-run being able to handle the case where it hits the end of the stack without hitting the sentinel frame.

No hard decision has been made to remove the termination frame (and cfi/cfa hints) yet, but there seems to be some desire to. We decided to keep it (and add a second push) for now, to allow us to ship 0.7.3 ASAP, and not break (or cause problems for) probe-run.

(suggested impl ahead, feel free to ignore): I think it should be possible to notice when hitting the stack top symbol, and if it was hit "gracefully", e.g. there is SOME frame right at the top of the stack with no excess data, it can be treated as the terminal frame, and not throw a "stack corrupted" warning.

I'm happy to provide a patched version of cortex-m-rt that would reproduce the currently discussed behavior for testing.

adamgreig commented 1 year ago

I have to apologise for making a liar of @jamesmunns here: in the end I decided to go directly to the startup code we want to end up with, which does not push a stack frame before branching to the main function. This will be in cortex-m-rt 0.7.3 which we'll encourage users to update to quickly to avoid the miscompilation issue (advisory); at that point they'll get a warning from probe-run when it attempts to unwind past Reset, like this:

stack backtrace:
   0: lib::inline::__delay
        at ./asm/inline.rs:62:5
   1: __delay
        at ./asm/lib.rs:51:17
   2: f4::__cortex_m_rt_main
        at src/main.rs:14:9
   3: main
        at src/main.rs:7:1
   4: Reset
(HOST) WARN  call stack was corrupted; unwinding could not be completed

It seems like it might be an option for probe-run to swap to using probe-rs's unwinding support instead of implementing its own unwinder; the unwinder in probe-rs didn't exist when probe-run was started but now seems quite usable, though I've not investigated how easy swapping to it would be. It might also be that the warning can just be turned off as it seems like it's easy to hit in a number of situations besides this one (such as if the backtrace is from a leaf function where LR is used as a general purpose register). In any event it doesn't seem like this change has any impact beyond generating the warning.

Urhengulas commented 1 year ago

Thank you for notifying us 💚

The check if we run into the sentinel frame happens here:

https://github.com/knurling-rs/probe-run/blob/6f3a3a4de428180716690b823985ecdb2f626ec0/src/backtrace/unwind.rs#L90-L92

If I understand correctly, we need to replace this with a new "we hit the stack top symbol"-check.

jamesmunns commented 1 year ago

That's my understanding! At the bottom of that function you advance the stack pointer. After advancing, if sp == _stack_start, you can also gracefully exit. If sp > _stack_start, you probably want to bail with a corrupted stack error.

I'm not sure how you would get the stack_start symbol - but it will definitely exist in the debuginfo, as it is a symbol PROVIDEd by the linker (here: https://github.com/rust-embedded/cortex-m/blob/8e4b18741d77a57eb0ae4d9a011611d1428a478d/cortex-m-rt/link.x.in#L63).

Urhengulas commented 1 year ago

In https://github.com/rust-embedded/cortex-m/pull/463#issuecomment-1428876508 @adamgreig said:

gdb is not affected by default (it detects the main function and stops unwinding past that [...])

Maybe we can do sth similar as well? We already have the start of the main fn, we would just need the size and then would know if the last address of unwinding is in the main fn.

https://github.com/knurling-rs/probe-run/blob/6f3a3a4de428180716690b823985ecdb2f626ec0/src/elf.rs#L49-L51

Urhengulas commented 1 year ago

@jamesmunns said:

That's my understanding! At the bottom of that function you advance the stack pointer. After advancing, if sp == _stack_start, you can also gracefully exit. If sp > _stack_start, you probably want to bail with a corrupted stack error.

I'm not sure how you would get the stack_start symbol - but it will definitely exist in the debuginfo, as it is a symbol PROVIDEd by the linker (here: https://github.com/rust-embedded/cortex-m/blob/8e4b18741d77a57eb0ae4d9a011611d1428a478d/cortex-m-rt/link.x.in#L63).

The _stack_start could be easily extracted while extracting the other symbols from the elf. In the example I just tried it looked like this:

[src/elf.rs:193] symbol = Symbol {
    name: "_stack_start",
    address: 537131968,
    size: 0,
    kind: Label,
    section: Absolute,
    scope: Dynamic,
    weak: false,
    flags: Elf {
        st_info: 16,
        st_other: 0,
    },
}

The symbols are extracted here: https://github.com/knurling-rs/probe-run/blob/6f3a3a4de428180716690b823985ecdb2f626ec0/src/elf.rs#L173-L194

knurling-rs / probe-run

Consider removing the need for a "sentinel stack frame" while unwinding #382