rp-rs / rp-hal

A Rust Embedded-HAL for the rp series microcontrollers
https://crates.io/crates/rp2040-hal
Apache License 2.0
1.4k stars 227 forks source link

Strange on-target-test results on rust beta #823

Open jannic opened 1 month ago

jannic commented 1 month ago

I tried this several times because I couldn't believe it, and it seems to be reproducible:

The command CARGO_TARGET_THUMBV6M_NONE_EABI_RUNNER="probe-rs run" cargo +beta test -p on-target-tests --no-fail-fast --test i2c_loopback --test i2c_loopback_async -- --chip rp2040 just hangs on the second test (i2c_loopback_async).

On stable, it works reliably: CARGO_TARGET_THUMBV6M_NONE_EABI_RUNNER="probe-rs run" cargo +stable test -p on-target-tests --no-fail-fast --test i2c_loopback --test i2c_loopback_async -- --chip rp2040.

If I only run i2c_loopback_async without running i2c_loopback first, it works on both stable and beta.

The hang vanishes if I add 4 cortex_m::asm::nop() calls in non_blocking::reset after the two free calls. 3 are not enough. A cortex_m::asm::delay(1000) in the same place neither so it's not purely a timing issue. More likely depends on the code size somehow?

Changing the i2c_loopback test (ie. the one I have to call first to make the second one fail) to only contain the embedded_hal test causes the call to i2c_loopback_async to return with:

ERROR panicked at on-target-tests/tests/i2c_tests/non_blocking.rs:36:93:
`async fn` resumed after completion
└─ panic_probe::print_defmt::print @ /home/jan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/panic-probe-0.3.2/src/lib.rs:104
<lvl> (1/3) running `embedded_hal`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:56
<lvl> (2/3) running `transactions_iter`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:62
<lvl> (3/3) running `i2c_write_iter`...
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:68
<lvl> all tests passed!
└─ i2c_loopback_async::tests::__defmt_test_entry @ tests/i2c_loopback_async.rs:43

Tested on a raspberry pico with the documented test connections, pico-probe, probe-rs 0.24.0 on linux. Rust versions are 1.80.0 and 1.81.0-beta.2.

jamesmunns commented 1 month ago

Noting my comments from chat:

Flaky behavior like this is generally from one of three sources:

  1. Potential UB - could be from nostd_async, or within rp-hal: https://github.com/sammhicks/nostd_async/issues/4
  2. Improper sync primitive usage, such as "check then wait", instead of "wait then check", I pointed to https://github.com/rp-rs/rp-hal/blob/956668edc862a2b20fe97841095a2e7beac8f232/rp2040-hal/src/i2c/controller/non_blocking.rs#L179-L191, or https://github.com/rp-rs/rp-hal/blob/885de2ea0730d02d03a12765d3ac3ef212aceb68/rp2040-hal/src/async_utils.rs#L121-L128 as examples of this where it could be possible to race with the interrupt
  3. (distant third) rustc/LLVM miscompilations not unheard of but very rare (only once I can think of in the history of embedded rust being stable?)
ithinuel commented 1 month ago

The unsoundnesses raised in https://github.com/sammhicks/nostd_async/issues/4 while true, are not exercised in these examples.

They all stem from the fact that if tasks are stack allocated, they may be moved which would break the internally kept pointer to them.

In this tests/examples, the tests run and block that main thread until the task has completed. By then, there’s no longer any pointer to the tasks kept anywhere (neither in nostd_async, nor in any Waker as there are no IRQ expected).

@jamesmunns I don’t quite understand your point 2, so for this answer I’ll assume you meant

in async_utils a race is possible if the interrupt fires after line 121 returned Pending but before the waker is even registered.

But the interrupts are expected to be masked (within the peripheral, but enabled in the NVIC), therefore the IRQ isn’t fired until (setup_flags)(periph) is called on line 124 after the waker has been registered.

There may be a need for more detailed documentation to explain the expected behaviour & responsibility, eg:

jamesmunns commented 1 month ago

@ithinuel I was more pointing to this part from dirbaio's comments:

cargo miri test and cargo miri run --example ... fail. I haven't looked deep in the code to know whether this is due to the above issues, or to other issues.

It's possible there is additional latent UB. Also UB doesn't necessarily need to be exercised to cause miscompilations.

You were correct at my second point, I'd definitely need to audit in more detail to come up with a specific indication!

jannic commented 1 month ago

Lacking any better idea, I did some bisecting: nightly-2024-06-13 is the first version where this happens, with nightly-2024-06-12 the on-target-tests succeed.