asm::delay blocks for 1.5 times longer on 0.6.5

japaric commented 3 years ago

interrupt::disable();
let mut p = defmt::unwrap!(cortex_m::Peripherals::take());
p.DCB.enable_trace();
p.DWT.enable_cycle_counter();
let start = DWT::get_cycle_count();
cortex_m::asm::delay(1_000_000);
let end = DWT::get_cycle_count();

defmt::info!("{:?}", end - start);

This snippet prints 1_500_000 with cortex-m 0.6.5 but 1_000_000 with cortex-m 0.6.4

Behavior observed on both the nRF52840 and the STM32L433.

I personally consider this to be a breaking change. The API doc does say that function blocks for "at least" N clock cycles but I would the function at worst to be off by some constant/arithmetic factor (+k) not by a geometric factor (*k).

And I would only expect that arithmetic factor in presence of interrupts.

I'm aware that delay can be off by a geometric factor is there's Flash latency but that's not what observed above: Flash configuration was not changed; only the cortex-m version was.

adamgreig commented 3 years ago

This is #312; previously the implementation was broken on Cortex-M7 cores (#236) where it would complete in fewer than the requested number of cycles, whereas now it always takes at least the requested number of cycles (or more, in these cases).

It's definitely not a breaking API change, but it's come in on cortex-m 0.6.5 (and 0.6.6 and 0.6.7) because it's using 0.7.1's delay implementation, which I guess could have been more clearly indicated in the changelog. A non-breaking version bump changing what could have been "calibrated" delays could definitely cause users problems. That said, the docs for this method are extremely clear that it may delay indefinitely longer than requested and shouldn't be used if accurate timing is required, and don't make any promises about constant vs geometric factors.

If there was an easy way to specialise it for CPU architecture so that it remains correct on all cortex-m cores that would be great, but as it is I think the current behaviour (correct on all CPUs, but slower than before) is better than the old behaviour (may undercount on some CPUs, but is more accurate on others), at least for 0.7. We could consider releasing 0.6.8 with the old delay behaviour and keeping the bug fix only for 0.7, but it would take some messing with the asm, which in 0.6.5+ is entirely re-exported from the new system in 0.7; we'd need to bring back the old .s files, pre-built bins, and optional inline asm, and then rename the relevant symbols to prevent duplicate symbol errors. We could also call this out more specifically in the CHANGELOG and update the documentation for 0.7 to be more clear that it is likely to delay for a constant factor longer on cortex-m0/3/4 cores.

skibon02 commented 1 year ago

Actually faced the same problem. I checked the assembler code in both cases with and without inline-asm feature. Initialization may be different, but the interesting part is the same:

   0x0800040c <+160>:   subs    r1, #1
   0x0800040e <+162>:   bne.n   0x800040c <seg1::__cortex_m_rt_main+160>

Here we have subs function, that decrements r1 value and branch that repeats this action untill r1 become zero. Totally we have 2 * r1 instructions + c (2 or 3 initialization instructions that also divide input parameter by 2 and save it to r1 register).

But totally we have 3 cycles on these two instructions and here's why: subs always takes 1 CPU cycle bne.n can take 1 or 2, but if branch is taken it is at least 2 cycles:

Cortex-M4 instructions reference P The number of cycles required for a pipeline refill. This ranges from 1 to 3 depending on the alignment and width of the target instruction, and whether the processor manages to speculate the address early.

Conditional branch: 1 or 1 + P cycles

Conditional branch completes in a single cycle if the branch is not taken

According to this info it can be 3 to 5 cycles total, so we should better divide input by 3 instead of 2. On my stm32f401 and stm32f446 microcontrollers it takes 3 cycles always.

I am not sure about other architectures than cortex-m4, but probably they should have the same behavior.

adamgreig commented 1 year ago

The nuisance is that it does change with architecture, in particular Cortex-M7 has a dual-issue pipeline which means two instructions may be retired per clock cycle in some cases.

As an example, consider this code:

#[cortex_m_rt::entry]
fn main() -> ! {
    rtt_init_print!();
    let mut cp = cortex_m::Peripherals::take().unwrap();
    cp.DCB.enable_trace();
    cp.SCB.enable_icache();
    cortex_m::peripheral::DWT::unlock();
    cp.DWT.enable_cycle_counter();
    rprintln!("delay(x)     | cpu cycles");
    rprintln!("-------------|-----------");
    for i in 1..10 {
        let delay = i * 1000;
        let t0 = cortex_m::peripheral::DWT::cycle_count();
        cortex_m::asm::delay(delay);
        let t1 = cortex_m::peripheral::DWT::cycle_count();
        rprintln!("{:<13}| {}", delay, t1 - t0);
    }
    loop {}
}

I built it in release mode for thumbv7em-none-eabihf and ran the exact same binary on both an STM32F401 (Cortex-M4) and STM32F767 (Cortex-M7), and got:

STM32F401:

delay(x)     | cpu cycles
-------------|-----------
1000         | 1506
2000         | 3005
3000         | 4504
4000         | 6004
5000         | 7504
6000         | 9005
7000         | 10505
8000         | 12004
9000         | 13505

STM32F767:

delay(x)     | cpu cycles
-------------|-----------
1000         | 519
2000         | 1018
3000         | 1518
4000         | 2018
5000         | 2520
6000         | 3019
7000         | 7019
8000         | 4018
9000         | 4518

The actual delay loop in the output binary is the same as you posted, a subs r1, #0x1 followed by bne back to the first instruction - so it's a two instruction loop, repeated x/2 times. The first case has 500 instances of these two instructions - on the stm32f4 that takes 3 clock cycles per repeat so 1500 cycles total, which matches your observation, but on the stm32f7 it can apparently issue both subs and bne in one cycle, so only takes 500 cycles total!

Actually, this kind of sucks - the whole reason we use x/2 is to try and be correct for dual-issue cores even though we'll over-delay on the simpler cores, but in fact it turns out this isn't even enough and we should just have x loop iterations, which will cause 3x cycle delay on the simpler cores. Maybe we could detect Cortex-M7 cores and change the iteration count to match, but that would add some overhead to shorter delays (though perhaps not in a way that matters).

skibon02 commented 1 year ago

@adamgreig I see. So if there is such a wide range of possible cycles per instruction, maybe there is a way to check it in compile time? Does the target give us enough information to exactly know how much cycles does it take for subs and bne instructions?

adamgreig commented 1 year ago

Those two examples were the exact same binary on the same target - the CPU isn't something we know at compile time.

Possibly we should just document that it will retire x/2 subs and bne instructions plus some small constant overhead, which is usually 1.5x cycles on cortex-m0/3/4 and 0.5x cycles on cortex-m7... but that's not especially satisfactory either.

rust-embedded / cortex-m

asm::delay blocks for 1.5 times longer on 0.6.5 #325