Stack painting with `--measure-stack` is slow

jonas-schievink commented 3 years ago

With --measure-stack, added in https://github.com/knurling-rs/probe-run/pull/254, we paint the whole area the stack could occupy with a bit pattern, and then read it back to determine the program's stack usage. This can write and read hundreds of KBs of RAM, which takes several seconds, so it would be great to speed this up.

One idea for speeding this up was to essentially run memset on the MCU, but probe-rs does not seem to expose an API for this (if this is even possible at all, with the vendor-provided on-device algorithms).

japaric commented 2 years ago

Context

the measurement consists of two steps:

before program start, fill the memory region that corresponds to the call stack with a known bit pattern
after program end, linearly search that memory region for the address that does not contain the known bit pattern

note that the search has to start at the "end" of the stack. in the case of the ARM ISA that would be the lowest address

Solution

here's how to make those two steps (hopefully) faster:

first, we should measure how long that takes right now. the operation is currently done using a probe_rs API that does a memcpy from the host to the target over USB.
to make step (1) faster try this:
- load a fill_stack subroutine to the target
- have the target execute that subroutine and pause (breakpoint instruction) when it's done
- the host busy waits until the target is done (hits the breakpoint)
to make step (2) faster try this:
- load a search_stack subroutine to the target
- have the target execute that subroutine, store the result address in a register and pause when it's done
- the host busy waits until the target is done then it reads the target's register that contains the result

these two operations can be prototyped outside probe-run using the probe_rs library.

these two alternative approaches should be timed before being integrated into probe-run. if it turns out they are slower then there's no point in integrating them.

More context

more details on loading and executing the program on the target:

How to write the subroutine?

the fill_stack function can be written in Rust but must be cross compiled to the thumbv6m-none-eabi target so that it also works with Cortex-M0. after that function is cross compiled it'll become machine code (a bunch of bytes); that's what needs to be loaded to the target. the function should be written in a way that's self-contained and does not perform any other function call (otherwise executing it becomes tricky) it's also OK to write the function in assembly -- actually it may be easier to avoid stack usage and function calls that way; as we'll only use the machine code it doesn't matter what the source code is

Where to load the subroutine?

after that, the question is where to load the subroutine: I would suggest loading it to RAM because that's easier than writing to Flash and that way there's no risk it'll collide with program we want to run on the target. careful here: the subroutine will write to RAM so the subroutine itself must be written somewhere it won't overwrite itself

How to run the subroutine?

to run the subroutine it should suffice to set the program counter (PC) register to the start of it and resume the target that would only be the case if the subroutine does not use any stack space; that should be the case for these simple functions but double check the assembly (the Stack Pointer register should NOT be modified)

Urhengulas commented 2 years ago

Reopening because only part of it is fixed so far.

knurling-rs / probe-run