oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3.03k stars 178 forks source link

Cortex-M0+ port #395

Closed cbiffle closed 2 years ago

cbiffle commented 2 years ago

It would be nice to have Cortex-M0+ support for situations in the future where we need a low-cost micro.

@jperkin has a prototype port.

We should talk. :-)

cbiffle commented 2 years ago

@jperkin thanks for doing this! It's neat.

My initial unsolicited thoughts:

The port looks very plausible. I would like to split the syscall stubs out of userlib/src/lib.rs now that there are two architectures -- perhaps into src/armv6m.rs and src/armv6m.rs and pull them in based on build settings. (This and the RISC-V port also cause me to want to generate the syscall stubs automatically, because there's a lot of duplication going on -- but that's probably a separate issue.)

The branch currently doesn't build. The commit it references in the stm32-nightlies repo has disappeared. The stm32 folks (for reasons I can't explain) don't guarantee that commits on master that repo today will still exist tomorrow -- this is part of why we had to fork it. However, the stm32g070 demo at least does build with the published 0.14 version of stm32g0, so I was able to inspect the results.

I expected that armv6m code would be larger, in general, than armv7m code because of the very limited instruction set, particularly addressing modes. I started by looking at the largest single hunk of code, which is the kernel. Comparison is proving involved, though, because of configuration differences -- since the stm32g0 parts don't implement ITM, and you're leaving them in their default HSI16 clock configuration, the code that actually gets compiled into the kernel is rather different for reasons that have nothing to do with the instruction set. (Also, so that I don't have to do this math again: the stm32h7's vector table is 476 bytes larger than the stm32g0's.)

A more apples-to-apples comparison is the user-leds task.

   text    data     bss     dec     hex filename
   6068       0       0    6068    17b4 target/demo-stm32g070-nucleo/dist/user_leds
   5664       0       0    5664    1620 target/demo-stm32h743-nucleo/dist/user_leds

That's a 7% difference, and the g070-nucleo's user-leds implementation is simpler than the h743-nucleo's because their LED counts differ. Based on past experience and making numbers up, I'd expect a ~10% increase in code size, which means our nearly-32-kiB kernel is going to start becoming a problem on the smaller m0s.

The syscall calling convention was initially designed for RV32I, then stuffed awkwardly into ARMv7-M; its use of high-numbered registers makes it rather awkward on ARMv6-M, where these registers are second-class citizens. We might be able to improve the syscall stub performance a bit, but your implementations look like a totally reasonable starting point.

By shrinking some of the flash allocations in the app.toml I've got the g070 demo image juuuust under 128kiB of Flash. It's currently using 32kiB of RAM, most of which is going to Hiffy, and I suspect that could be shrunk but since I don't have hardware to test on, I couldn't validate the result. (Not entirely sure what Hiffy is doing with all that RAM.) I notice that tasks are right around 6kiB on average, which suggests that there's several kibibytes of runtime code getting pulled in to everything -- when this happened on ARMv7M it turned out that I'd accidentally done something silly in the panic handler that caused a lot of unused formatting code to appear used to the linker. At some point I'm interested in taking a pass over the armv6m binaries and looking for similar opportunities, but that doesn't need to block merging.

cbiffle commented 2 years ago

Hm. I am frustrated to note that the hprintln and friends on armv6m are generating cpsid/cpsie in unprivileged code. It just so happens that this will work, because (1) these instructions are no-ops rather than traps in unprivileged code and (2) Hubris unprivileged code is single-threaded. Perhaps semihosting is also generating these instructions on armv7m -- I haven't used semihosting recently. In any case, this isn't a problem specific to this port, but is an issue with how the svd2rust-derived ecosystem treat critical sections and support (or don't) unprivileged mode.

I'm also seeing cpsid in the kernel, but it seems to be entirely semihosting-related there too. (In general we try not to disable interrupts, ever; semihosting is an odd exception because, as a breakpoint-based facility, it inherently stops the whole chip. So this is probably fine.)

cbiffle commented 2 years ago

I notice that tasks are right around 6kiB on average, which suggests that there's several kibibytes of runtime code getting pulled in to everything -- when this happened on ARMv7M it turned out that I'd accidentally done something silly in the panic handler that caused a lot of unused formatting code to appear used to the linker.

It's Debug formatting, FWIW. The linker's insistence that much Debug formatting is used likely means I need to revisit the panic handler, again.

I've gotten the demo to fit into 64kiB Flash. It required some hack-and-slash.

It is now sitting pretty in 32kiB RAM and 51,200 bytes of Flash. However, I don't have a 64kiB STM32G0 board here to test on, and fitting it into my 32kiB STM32G030 board seems like it'll take some doing, so, I'm going to see about porting to my STM32L053 board to test. That's going to require some aggressive reduction in RAM usage.

I have some further ideas for reducing resources requirements on this class of microcontroller that I'll try later on.

cbiffle commented 2 years ago

Got the port working on the STM32L053.

Used:
  flash: 0x6c00 (42%)
  ram:   0x1800 (75%)

Could probably get that RAM number down with some care, it's almost entirely stacks, and they're likely bigger than they need to be.

cbiffle commented 2 years ago

The port has merged in #401. Final results are better than what I posted above; I've got Hiffy working on the G031, and the baseline demo without Hiffy is down to about 20 kiB Flash, 3.5 kiB RAM.

Thanks @jperkin for kicking this off.