knurling-rs / probe-run

Run embedded programs just like native ones
Apache License 2.0
643 stars 75 forks source link

Stack painting with `--measure-stack` is slow #258

Closed jonas-schievink closed 2 years ago

jonas-schievink commented 3 years ago

With --measure-stack, added in https://github.com/knurling-rs/probe-run/pull/254, we paint the whole area the stack could occupy with a bit pattern, and then read it back to determine the program's stack usage. This can write and read hundreds of KBs of RAM, which takes several seconds, so it would be great to speed this up.

One idea for speeding this up was to essentially run memset on the MCU, but probe-rs does not seem to expose an API for this (if this is even possible at all, with the vendor-provided on-device algorithms).

japaric commented 2 years ago

Context

the measurement consists of two steps:

  1. before program start, fill the memory region that corresponds to the call stack with a known bit pattern
  2. after program end, linearly search that memory region for the address that does not contain the known bit pattern

note that the search has to start at the "end" of the stack. in the case of the ARM ISA that would be the lowest address

Solution

here's how to make those two steps (hopefully) faster:

these two operations can be prototyped outside probe-run using the probe_rs library.

these two alternative approaches should be timed before being integrated into probe-run. if it turns out they are slower then there's no point in integrating them.

More context

more details on loading and executing the program on the target:

How to write the subroutine?

the fill_stack function can be written in Rust but must be cross compiled to the thumbv6m-none-eabi target so that it also works with Cortex-M0. after that function is cross compiled it'll become machine code (a bunch of bytes); that's what needs to be loaded to the target. the function should be written in a way that's self-contained and does not perform any other function call (otherwise executing it becomes tricky) it's also OK to write the function in assembly -- actually it may be easier to avoid stack usage and function calls that way; as we'll only use the machine code it doesn't matter what the source code is

Where to load the subroutine?

after that, the question is where to load the subroutine: I would suggest loading it to RAM because that's easier than writing to Flash and that way there's no risk it'll collide with program we want to run on the target. careful here: the subroutine will write to RAM so the subroutine itself must be written somewhere it won't overwrite itself

How to run the subroutine?

to run the subroutine it should suffice to set the program counter (PC) register to the start of it and resume the target that would only be the case if the subroutine does not use any stack space; that should be the case for these simple functions but double check the assembly (the Stack Pointer register should NOT be modified)

Urhengulas commented 2 years ago

Reopening because only part of it is fixed so far.