rp-rs / rp-hal

A Rust Embedded-HAL for the rp series microcontrollers
https://crates.io/crates/rp2040-hal
Apache License 2.0
1.45k stars 235 forks source link

Support for placing functions/statics in ram #576

Open DusterTheFirst opened 1 year ago

DusterTheFirst commented 1 year ago

The RAM of the pico has much faster access time than the flash, and for very tight loops this access time can matter. It would be nice to be given a way to place a function in RAM. The current memory.x from this crate just defines the main 256K block of ram, but there are also 2 4K blocks of ram at address 0x20040000 and 0x20041000 which can be useful for placing single-core specific high-bandwidth data that can be lock-free (as described in the RP2040 datasheet).

2.6. Memory SRAM is mapped to system addresses starting at 0x20000000. The first 256kB address region is word-striped across the four larger banks, which provides a significant memory parallelism benefits for most use cases. The next two 4kB regions (starting at 0x20040000 and 0x20041000) are mapped directly to the smaller, 4kB memory banks. Software may choose to use these for per-core purposes, e.g. stack and frequently-executed code, guaranteeing that the processors never stall on these accesses.

Being able to choose to use these as one contiguous block, or placing data directly into them would be a welcome feature of the HAL. At least demonstrating how a user could add to the LinkerScript memory.x in such a way that their code can use #[link_section = "..."] to place items there manually (as many users are not familliar with LinkerScript).

jannic commented 1 year ago

This is a very interesting topic, and I'd love to have good solutions!

What's the main reason you want to have code in RAM?

In the first case, as you wrote, some #[link_section ...] annotations together with a matching memory.x could work. The performance boost probably depends strongly on access patterns. If the XIP cache has a good hit rate, code in flash shouldn't be much slower than code in RAM. We should try to measure the difference.

For worst-case latency, I see one issue: While the rp2040 guarantees that RAM access is free of wait-states (under certain conditions), the rust compiler doesn't guarantee that all running code is in RAM. Even if a function is annotated with a proper link_section, parts of the code could be moved to different functions by compiler optimization steps. And code might contain hidden function calls to functions in other sections. Eg. due to intrinsics. Code that needs to disable XIP, for example for re-writing flash, has the same issue. The solution is often to write critical code in assembly.

If you want to be sure that all relevant code is in RAM, and don't want to use assembly, it may be best to put the whole program in RAM, copied from flash at boot time before main. But that would be an entirely different approach, of course with the strict limitation that the code needs to fit in the available RAM.

In short, I don't think there is one solution to your request, but there's a spectrum of possibilities.

DusterTheFirst commented 1 year ago

The main reason this has become an interest for me is because I am attempting to re-implement the main functionality of the pico-DVI library. That library really pushes the limits of the pico, requiring an overclock to >200MHz.

Honestly, I am not more knowledgeable about the rp2040 than the author of that library who was on the team who created the chip, so when I see that he did something in his implementation, I kind of just follow. He places a lot of the pixel to 8b/10b conversion (which is how DVI transmits its binary data) in RAM, as well as placing the lookup tables in the scratch ram blocks.

You may be correct that the XIP cache may be enough to keep the hot code in cache, but since the DVI decoding code works off of interrupts I was unsure if the XIP cache would do well at caching constant jumps to and from interrupt handlers at the over clock I need. But, the implementation may just work all in flash, I have not gotten far enough to tell. I will use the XIP cache hit/miss performance counters once I get my implementation working a bit more as to be able to see if the XIP cache is enough for my purposes.

jannic commented 1 year ago

This is actually a very good use case for code in RAM: It's a real-time situation, where it's not enough to have a good average performance. If the CPU gets delayed by a XIP cache miss, it will provide the video data too late and there will be glitches in the image. It's not even about tight loops, as they would be in XIP cache after the first iteration. Just the latency from the first access is too much.

And in this case, the missing guarantees from the compiler don't matter much: In case some future compiler version does a new optimization which happens to move some of the code to flash, it wouldn't cause any serious damage, but only a bad DVI signal.

The Neotron Pico is in the same situation, just with a VGA output instead of DVI. It just places a few functions in the data section so I think it doesn't even need a special entry in the linker file: https://github.com/Neotron-Compute/Neotron-Pico-BIOS/blob/develop/src/vga/mod.rs#L1278

@thejpster, I didn't follow the latest developments. What's your experience with placing those functions in RAM, does it actually help in practice?

thejpster commented 1 year ago

Yes it helps my use case greatly. Before we had sparkling video noise where randomly some lines didn't render in time and we could only do two global colours. Now it's rock solid even with two colours per character.

The effect can be measured by putting an oscilloscope on the LED (Pin 13 on the Pico) as that records the time taken in the render routine.

ithinuel commented 1 year ago

I guess we could add in the examples' memory.x areas for NON_STRIPPED ram.

Would you expect one or two areas?

DusterTheFirst commented 1 year ago

Right now, the only memory that the rust code can even use is the main 256 KiB. It would be nice to be able to place things in the two 4KiB regions. So at a minimum, it would be good to have those two sections for manually placed in ram content.

jannic commented 1 year ago

I think those should be separate areas, as one important use case would be to define dedicated RAM areas for each core to have uncontested access.

But how shall those regions be called?

We could use the naming from C SDKs linker file: https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2_common/pico_standard_link/memmap_default.ld#L26-L29 I don't know where those names, SCRATCH_X, SCRATCH_Y came from, but why invent another name if there is some existing convention?

However, there are other possible choices. For example (S)RAM4/5: The SRAM is organized in 6 banks (0-5) which can be individually powered, have separate bus performance counters (datasheet section 2.1.1.2) etc., and the registers related to those functions are call these regions SRAM4 and SRAM5. Parts of the C SDK use that naming as well: https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2040/hardware_regs/include/hardware/regs/addressmap.h#L30-L31 This naming feels a little bit more substantiated than the arbitrary scratch x/y one. But it may look strange to have regions named SRAM4/5 and not SRAM0-3.

ithinuel commented 1 year ago

The other regions are also memory mapped unstripped so I guess we could also expose them as such :shrug: But that seems more like a footgun than anything wet aliasing rules etc.

I'd be happy with the declaration of SRAM4 and SRAM5 with a little comment above saying SRAM0-3 are stripped and exposed as one block.

jannic commented 1 year ago

So this could be the first step: https://github.com/rp-rs/rp-hal/pull/578

I didn't define linker sections yet, because I don't know how to best initialize those regions.

The C SDK uses a list of memory regions to be initialized by crt0: https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2_common/pico_standard_link/crt0.S#L284-L302 https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2_common/pico_standard_link/crt0.S#L232-L241

Our "crt0" comes from cortex-m-rt. That one only copies one block: https://github.com/rust-embedded/cortex-m/blob/master/cortex-m-rt/src/lib.rs#L564-L574

So for full support of those regions, we'd need to extend or fork cortex-m-rt.

ithinuel commented 1 year ago

As far as I understand, rust does not require memory region to be initialised in a certain way (see the nomicon).

I believe it's fine to just define them in the linker and let the user place things in there with the attribute.

jannic commented 1 year ago

As far as I understand, rust does not require memory region to be initialised in a certain way (see the nomicon).

That's true as long as you expect the variable to be uninitialized. But given this program:

static mut RAM: u32 = 1;
#[link_section = "sram4"]
static mut SRAM4: u32 = 2;

#[entry]
fn main() -> ! {
    info!("Program start");

    info!("ram: {}", unsafe { RAM });
    info!("sram4: {}", unsafe { SRAM4 });
[...]

Would you expect this output?

INFO  Program start
└─ rp2040_project_template::__cortex_m_rt_main @ src/main.rs:22
INFO  ram: 1
└─ rp2040_project_template::__cortex_m_rt_main @ src/main.rs:24
INFO  sram4: 4294967295
└─ rp2040_project_template::__cortex_m_rt_main @ src/main.rs:25

(Full example is here: https://github.com/jannic/rp2040-project-template/blob/8faf7e08207c972fdf197efe04ed6725bfc42ed2/src/main.rs#L25)

ithinuel commented 1 year ago

Ah, indeed, but then we'd need a slightly more complex pre-main code to setup the other regions. Changing the pre-main loader in all code base for this usecase, I'm not sure it's worth it in most cases.

I'd go with no sections (just the memory areas) and maybe a bit of documentation about MaybeUninit or the pre-main loading requirement. I don't know where that should be hosted though.

DusterTheFirst commented 1 year ago

This could also be behind a feature gate, removing any pre-main code for those not interested in using the 2 banks.

ithinuel commented 1 year ago

I mean, this si currently handled by the cortex-m-rt crate.

I guess it could be done in the entry wrapper where we initialize the spinlocks. Feel free to open a PR that'd do what you need :)

jannic commented 1 year ago

Maybe this needs to be done earlier, from some assembly startup code. Running Rust code before the statics are initialized can easily be unsound. But some proof of concept code would be a great first step.

jannic commented 1 year ago

I think it's possible to implement the initialization with a __pre_init function in global_asm!: https://github.com/jannic/rp2040-project-template/blob/b969a326fd1151fb3ad8db1930beddf282a0cc84/src/main.rs#L23-L58

(And I spent far too much time fighting with inexplicable linker errors until I noticed that it was flip-link that probably doesn't handle the additional sections well. EDIT: Which is a known limitation, https://github.com/knurling-rs/flip-link/issues/3) Another EDIT: It looks like the final version does work with flip-link. Not sure which difference in some intermediate version was broken when flip-link was enabled but did work when I disabled it.

bigbass1997 commented 1 year ago

Hello, noticed this issue after encountering (what I determined to be) cache misses that caused seemingly random delays in time-sensitive situations. Thought I'd just throw in my use-case to the discussion.

I'll try to keep this short: my project is basically a hardware-emulated NES (Nintendo Entertainment System) controller, but where the input values are predefined, and are streamed over USB. Each bit of input must be asserted within a small window (~5us). If even a single window is missed, the predefined inputs will no longer synchronize with what was expected in-game. (This could probably be handled by a PIO but there are many quirks that make it easier to use GPIO interrupt handlers instead.)

Despite isolating the time-sensitive code into core0, with the USB handling in core1, I was noticing rare instances where simple interrupt handlers in core0 were taking significantly longer than normal. It appeared to only happen when core1 was handling some USB communication, so naturally I thought it was somehow blocking core0. However, even after rewriting the USB handling to only happen during downtime, I still noticed occasional delays.

Turns out, XIP cache misses were the ultimate problem. The USB handling code in core1 was complex/large enough that in some cases, core0 would encounter a cache miss. Unknown to me at first, cache misses can cause relatively huge delays (code that normally takes ~0.5us would take 10x longer, if not more).

(Side note: I measured performance the same way thejpster mentioned, by turning a GPIO pin on/off at the beginning/end of a given function, and then recording that with a scope/logic-analyzer)


My solution was to add a section to memory.x:

.ram_code ORIGIN(RAM) :
{
    KEEP(*(.ram_code));
} > RAM

And then mark the relevant functions with #[link_section = ".ram_code"], which I only knew how to do from an unrelated embedded project (compiling Rust to run on the N64). I don't know if this is the "best" way, but it works for me.

That all said, I'm not sure what the best solution is from a HAL perspective. At the least, perhaps add a similar section to the linker, and an example that makes use of it. If I had known the XIP cache existed, and the potential slow-downs it could cause, I would have saved myself a lot of debugging. (even after reading through the RP2040 datasheet, it didn't immediately occur to me that it would cause such significant problems for my situation)

jannic commented 1 year ago

My solution was to add a section to memory.x:

.ram_code ORIGIN(RAM) :
{
    KEEP(*(.ram_code));
} > RAM

And then mark the relevant functions with #[link_section = ".ram_code"], which I only knew how to do from an unrelated embedded project (compiling Rust to run on the N64). I don't know if this is the "best" way, but it works for me.

That's interesting: If you do it this way, how does the RAM get initialized on startup? As I understood it, one would need to mark this with >RAM AT>FLASH and have some startup code to initialize RAM from flash. (I'm not saying that the way you describe doesn't work - just that I don't know how it works.)

bigbass1997 commented 1 year ago

I'm not 100% certain. I just copied the bootloader section since I assumed something must already load that as well. https://github.com/rp-rs/rp2040-project-template/blob/b408c04af9c2b7326a87418fd5ca20e08e9e2b10/memory.x#L11-L14

Looking at the compiled ELF, both .boot2 and my added .ram_code are specified as loadable segments. I'd expect the program that parses the ELF, checks for loadable segments and places them where they are defined to exist. The data blocks in UF2 have an address associated with where the data should be placed in memory, so UF2 files should be capable of loading data into RAM. But the more I look into the different programs, it seems they don't all behave the same way.

I've been using probe-cli-rs to upload my program via SWD. I just tried running my project with elf2uf2-rs instead, and it errors with "ELF contains memory contents for uninitialized memory". Also tried probe-run, which claims it flashes successfully, but then errors with "unable to determine reset handler", and doesn't work, even after disconnect/reconnecting device power. Upon more testing, I am still sure that using probe-cli-rs does work, but I've noticed that if I disconnect/reconnect my device from power (USB), that it stops working.

So maybe probe-cli-rs is doing something weird that makes this just happen to work? At first I thought maybe it uploads the entire program to RAM, but that doesn't fit, because if I purposefully disable the XIP Cache, the entire program does drastically slow down. And if that was the case, I wouldn't have run into these cache slowdowns in the first place.


Edit: Just realized my thought process about initialization was quite wrong. When I use probe-cli-rs to program the device, it must load sections marked for RAM, into RAM. And that explains why it works until I power-cycle the device. The other runners must either not initialize anything marked for RAM, or otherwise detect this scenario and panic.

I guess the proper solution would either require a bootloader that initializes RAM for you (somehow), or include some code in the main project that runs at/before the start of main(). I'm not really sure how to do that.

DusterTheFirst commented 1 year ago

Also tried probe-run, which claims it flashes successfully, but then errors with "unable to determine reset handler"

This may not be an issue with your memory.x, but more related to this issue with the most recent version of probe-run https://github.com/knurling-rs/probe-run/issues/391#issuecomment-1505128718

But your experience also mirrors mine, the ram sections seem to be loaded fine, but that must have been due to me using probe-run, and using SWD to upload the code, and the SWD uploader being able to write directly to RAM.

bigbass1997 commented 1 year ago

But your experience also mirrors mine, the ram sections seem to be loaded fine, but that must have been due to me using probe-run, and using SWD to upload the code, and the SWD uploader being able to write directly to RAM.

Yep I think that's what is happening.

Honestly I'm not well-versed with linker scripts or writing bootloader code. I know some bits and pieces. But I think I've come up with something that works. Implemented a minimal example using the RP2040 template: https://github.com/bigbass1997/rp2040-project-template/commit/08e6c56846e11395d95151cea6a4d75e75c7a4bd

I'm fairly sure there is some standard section name that should be used instead of .ram_code, but I'm not certain which is most appropriate. The example prints the location of the relocated function foobar(), from within the function. So if the relocation worked, you should see something like INFO Executing from 0x200...... The LED blinking and the RTT info!'s prove the function works, and the fn pointer address proves it's located in RAM.

And using similar changes in my main project, that also now works even after power-cycling the device.

thejpster commented 1 year ago

I just put my functions in the ".data" section.

jannic commented 1 year ago

I just put my functions in the ".data" section.

That's a good point: There is no additional support needed to place functions/statics in RAM, so the title of this ticket is a little bit misleading. https://github.com/rp-rs/rp-hal/issues/576#issue-1646321873 asks additional questions, this is where no direct support is available yet.

I wonder if we have a good place to document how to place functions in the data section.

ithinuel commented 1 year ago

I think the difference is that the .data section is in the general (stripped) RAM section, while placing these ram-func in the extra space on the RP2040 enables some optimized memory bus path for code execution. (not 100% of my understanding though).

jannic commented 1 year ago

The "extra space" is not really optimized in any way. It's just two separate RAM banks. But, depending on what the firmware is doing, that might be a significant advantage: If eg. some DMA is putting a lot of load on the striped RAM banks, overall performance will be better if the CPU instructions are read from a separate bank.

Anyways, there are two separate issues:

1) As taken from the title, "Support for placing functions/statics in ram": This is solved by putting functions in the ".data" section, no further support from the HAL is necessary. There's a caveat that rustc doesn't guarantee that the code doesn't jump to code in flash from the RAM function, but that's an issue that can't be solved from the HAL. 2) Place code in specific RAM banks, like RAM4/RAM5. This is currently not possible without manually copying the code from flash to RAM at startup. Of course, this is something where the HAL could help. But I'm not sure if there's a good generic solution that would be suitable for the HAL. If any real-world firmware would deviate from what the HAL provides, it would be better just describe the approach in some documentation.

raphlinus commented 1 year ago

I think this is mostly a question of documentation. There are (at least) three choices for placing functions in RAM. As pointed out above, the easiest is to just place it in the .data segment.

The next choice is to make a new segment (.ram_code) and place it so that it's copied by the Reset code in cortex-m-rt. This is done by placing the following invocation in memory.x:

SECTIONS {
    .ram_code: {
        *(.ram_code .ram_code.*)
        . = ALIGN(4);
    } > RAM AT > FLASH
} INSERT AFTER .data;

This uses a fragile and poorly documented mechanism to cause __edata to point after both the .data and .ram_code segments, so both get copied from flash to RAM.

One advantage for using a separate section is that it's easier to experiment with having this code either XIP from flash or copied to RAM just by changing two lines in memory.x: > RAM AT > FLASH becomes > FLASH, and INSERT AFTER .data becomes INSERT AFTER .rodata. A warning, do not use INSERT AFTER .data unless the LMA is > RAM AT > FLASH, otherwise the source and destination sizes for the copy will get out of sync and it will clobber other important data in RAM.

Lastly, for DVI you really do want code in scratch RAM (aka SRAM4 and SRAM5), as there is a guarantee there will be no contention in the bus arbiter as long as only one core is accessing it. The magic invocation is to place this in memory.x:

MEMORY {
    BOOT2: ...
    FLASH: ...
    RAM: ...
    SCRATCH_X : ORIGIN = 0x20040000, LENGTH = 4k
}

...

SECTIONS {
    .scratch_x: {
        _scratch_x_start = .;
        *(.scratch_x .scratch_x.*)
        . = ALIGN(4);
        _scratch_x_end = .;
    } > SCRATCH_X AT > FLASH
    _scratch_x_source = LOADADDR(.scratch_x);
} INSERT AFTER .rodata;

However, this is not copied by cortex-m-rt, so you'll have to do it yourself. This can be done in a __pre_init function as suggested above, or just a few lines in main:

extern "C" {
    static _scratch_x_source: u8;
    static mut _scratch_x_start: u8;
    static mut _scratch_x_end: u8;
}

#[entry]
fn main() -> ! {
    unsafe {
        copy_nonoverlapping(
            &_scratch_x_source as *const u8,
            &mut _scratch_x_start as *mut u8,
            &_scratch_x_end as *const u8 as usize - &_scratch_x_start as *const u8 as usize,
        );
    }
    ...
}

It's possible the support code could be improved, as these sections are copied by default in pico-sdk, but we're not blocking on that.

I hope this helps other people grappling with similar problems. Thanks to @jannic for helping me on Matrix.

thejpster commented 1 year ago

I think your last point is OK in practice, but not in theory. Entering main() with any uninitialised statics (that aren't MaybeUnitialised) is, I believe, Undefined Behaviour.

jannic commented 1 year ago

There is code doing the copying from global_asm called from pre_init linked from one of my comments above. As far as I know that should do it without UB. But while still experimenting the pure rust version might be more convenient, even if it's formally UB. I don't see any practical reason why the compiler should miscompile it. (Famous last words?)