Strange on-target-test results on rust beta

jannic commented 3 months ago

I tried this several times because I couldn't believe it, and it seems to be reproducible:

The command CARGO_TARGET_THUMBV6M_NONE_EABI_RUNNER="probe-rs run" cargo +beta test -p on-target-tests --no-fail-fast --test i2c_loopback --test i2c_loopback_async -- --chip rp2040 just hangs on the second test (i2c_loopback_async).

On stable, it works reliably: CARGO_TARGET_THUMBV6M_NONE_EABI_RUNNER="probe-rs run" cargo +stable test -p on-target-tests --no-fail-fast --test i2c_loopback --test i2c_loopback_async -- --chip rp2040.

If I only run i2c_loopback_async without running i2c_loopback first, it works on both stable and beta.

The hang vanishes if I add 4 cortex_m::asm::nop() calls in non_blocking::reset after the two free calls. 3 are not enough. A cortex_m::asm::delay(1000) in the same place neither so it's not purely a timing issue. More likely depends on the code size somehow?

Changing the i2c_loopback test (ie. the one I have to call first to make the second one fail) to only contain the embedded_hal test causes the call to i2c_loopback_async to return with:

ERROR panicked at on-target-tests/tests/i2c_tests/non_blocking.rs:36:93:
`async fn` resumed after completion
└─ panic_probe::print_defmt::print @ /home/jan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/panic-probe-0.3.2/src/lib.rs:104
<lvl> (1/3) running `embedded_hal`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:56
<lvl> (2/3) running `transactions_iter`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:62
<lvl> (3/3) running `i2c_write_iter`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:68
<lvl> all tests passed!
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:43

Tested on a raspberry pico with the documented test connections, pico-probe, probe-rs 0.24.0 on linux. Rust versions are 1.80.0 and 1.81.0-beta.2.

jamesmunns commented 3 months ago

Noting my comments from chat:

Flaky behavior like this is generally from one of three sources:

Potential UB - could be from nostd_async, or within rp-hal: https://github.com/sammhicks/nostd_async/issues/4
Improper sync primitive usage, such as "check then wait", instead of "wait then check", I pointed to https://github.com/rp-rs/rp-hal/blob/956668edc862a2b20fe97841095a2e7beac8f232/rp2040-hal/src/i2c/controller/non_blocking.rs#L179-L191, or https://github.com/rp-rs/rp-hal/blob/885de2ea0730d02d03a12765d3ac3ef212aceb68/rp2040-hal/src/async_utils.rs#L121-L128 as examples of this where it could be possible to race with the interrupt
(distant third) rustc/LLVM miscompilations not unheard of but very rare (only once I can think of in the history of embedded rust being stable?)

ithinuel commented 3 months ago

The unsoundnesses raised in https://github.com/sammhicks/nostd_async/issues/4 while true, are not exercised in these examples.

They all stem from the fact that if tasks are stack allocated, they may be moved which would break the internally kept pointer to them.

In this tests/examples, the tests run and block that main thread until the task has completed. By then, there’s no longer any pointer to the tasks kept anywhere (neither in nostd_async, nor in any Waker as there are no IRQ expected).

@jamesmunns I don’t quite understand your point 2, so for this answer I’ll assume you meant

in async_utils a race is possible if the interrupt fires after line 121 returned Pending but before the waker is even registered.

But the interrupts are expected to be masked (within the peripheral, but enabled in the NVIC), therefore the IRQ isn’t fired until (setup_flags)(periph) is called on line 124 after the waker has been registered.

There may be a need for more detailed documentation to explain the expected behaviour & responsibility, eg:

There must be NO IRQ source remaining unmasked when no IRQ is expected. As when one becomes expected, the IRQ handler may fire even before a waker is registered for it.
The IRQ handler must clear the flag that caused it (to avoid IRQ locking)
An IRQ may only be unmasked in the peripheral AFTER a waker has been registered for it

jamesmunns commented 3 months ago

@ithinuel I was more pointing to this part from dirbaio's comments:

cargo miri test and cargo miri run --example ... fail. I haven't looked deep in the code to know whether this is due to the above issues, or to other issues.

It's possible there is additional latent UB. Also UB doesn't necessarily need to be exercised to cause miscompilations.

You were correct at my second point, I'd definitely need to audit in more detail to come up with a specific indication!

jannic commented 3 months ago

Lacking any better idea, I did some bisecting: nightly-2024-06-13 is the first version where this happens, with nightly-2024-06-12 the on-target-tests succeed.

sammhicks commented 2 months ago

Hi, I'm the developer of nostd_async.

There was definitely UB in nostd_async, which should now have been resolved.

Miri complains about version 0.6.1, but not 0.7.0 to 0.7.2.

However, I'm no longer developing it, so you may wish to use a different async runtime, such as embassy.

thejpster commented 2 months ago

Does diffing the objdump turn up anything interesting? I like to use arm-none-eabi-objdump --no-addresses --no-show-raw-insn -dCt ./file.elf.

rp-rs / rp-hal

Strange on-target-test results on rust beta #823