raspberrypi / pico-sdk

BSD 3-Clause "New" or "Revised" License
3.61k stars 897 forks source link

Possible race condition in `add_alarm_in_us`/`sleep_us`, incorrect alarm period using LTO #1500

Closed tttapa closed 3 weeks ago

tttapa commented 12 months ago

Below is a simple program that uses a recurring timer interrupt using add_alarm_in_us to blink an LED as a liveness check. In the main loop, it simply reads characters from stdin (USB CDC), uses sleep_us, and echos them to stdout.

When sending enough data to the Pico, it eventually stops blinking, which should never happen.
The Pico is still alive, but the blink timer interrupt somehow stopped firing. After some time (minutes to hours), it suddenly starts blinking again, so it seems like it's just the alarm period that somehow gets corrupted.

The reason that I wrote this small program is that in a much larger program, I sometimes encountered sleep_us hanging in a similar way when the USB traffic was high. I would have a sleep_us(100) somewhere, for example, and I would notice that the Pico would hang on that call for minutes. When attaching a debugger, it turned out that it was indeed waiting inside of sleep_us (all relevant variables were optimized out, so I didn't investigate further, this only happens in release mode).

main.cpp

#include <pico/stdio.h>
#include <pico/stdlib.h>
#include <cstdio>
#include <initializer_list>

// #define SLEEP_US(x) busy_wait_us(x)
#define SLEEP_US(x) sleep_us(x)

static constexpr uint wait_led_pin = 19;
static constexpr uint blink_led_pin = PICO_DEFAULT_LED_PIN;
static constexpr uint test_pin = 2;

int64_t blink(alarm_id_t, void *) {
    static bool led_state = false;
    gpio_put(blink_led_pin, led_state ^= 1);
    return -80'000;
}

void loop() {
    static bool led_state = false;
    gpio_put(wait_led_pin, led_state ^= 1);
    int ch = getchar_timeout_us(100'000);
    if (ch >= 0) {
        gpio_put(test_pin, 1);
        SLEEP_US(static_cast<uint8_t>(ch));
        gpio_put(test_pin, 0);
        putchar_raw(ch);
    }
}

int main() {
    for (uint pin : {wait_led_pin, test_pin, blink_led_pin}) {
        gpio_init(pin);
        gpio_set_dir(pin, GPIO_OUT);
    }

    stdio_init_all();

    auto timer_id = add_alarm_in_us(80'000, blink, nullptr, false);
    if (timer_id < 0)
        panic("No alarm slots available");
    if (timer_id == 0)
        panic("Alarm already passed");

    while (1)
        loop();
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.18)
include(cmake/pico_sdk_import.cmake) # Pull in SDK (must be before project)
project(pico-sleep-test C CXX ASM)
pico_sdk_init() # Initialize the SDK

add_executable(main main.cpp)
target_link_libraries(main PRIVATE pico_stdlib)
pico_enable_stdio_usb(main 1)
pico_enable_stdio_uart(main 0)
pico_add_extra_outputs(main)

Build

export PICO_SDK_PATH="${HOME}/pico/pico-sdk"
cmake -S. -Bbuild \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DPICO_CXX_ENABLE_EXCEPTIONS=Off \
    -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=On
cmake --build build -j -t main

Test

port="ttyACM1"

# Upload the firmware and reset the Pico
~/opt/raspberrypi-openocd/bin/openocd -f interface/cmsis-dap.cfg -c "adapter speed 5000" -f target/rp2040.cfg -c "program build/main.elf verify reset exit"
sleep 3
# Generate some random data to send
[ -e challenge.bin ] || head /dev/urandom -c 1M > challenge.bin
# Configure the serial port
stty -F /dev/$port 115200 raw crtscts -echo
# Discard any old data already in the buffer
timeout 1 cat /dev/$port > /dev/null ||:
# Read the response and write it to a file
cat /dev/$port > response.bin &
# Send the data to the Pico
cat challenge.bin > /dev/$port
# Wait for all data to be received and stop reading
sleep 10
kill -INT %1
# Make sure the response matches the challenge
cmp {challenge,response}.bin

Tools

Ubuntu 22.04.3
pico-sdk 1.5.1
arm-none-eabi-gcc (15:10.3-2021.07-4) 10.3.1 20210621 (release)
libnewlib-arm-none-eabi 3.3.0-1.3

Pico SDK patch
Required to support LTO (otherwise, the linker can't find the __wrap_ functions).

diff --git a/src/rp2_common/pico_platform/include/pico/platform.h b/src/rp2_common/pico_platform/include/pico/platform.h
index 64b5e96..5f905ac 100644
--- a/src/rp2_common/pico_platform/include/pico/platform.h
+++ b/src/rp2_common/pico_platform/include/pico/platform.h
@@ -484,7 +484,7 @@ static __force_inline uint __get_current_exception(void) {
     return exception;
 }

-#define WRAPPER_FUNC(x) __wrap_ ## x
+#define WRAPPER_FUNC(x) __attribute__((used)) __wrap_ ## x
 #define REAL_FUNC(x) __real_ ## x

 /*! \brief Helper method to busy-wait for at least the given number of cycles

I understand that LTO/CMAKE_INTERPROCEDURAL_OPTIMIZATION is not officially supported, however, the fact that it fails with LTO enabled may point to an underlying problem such as a data race or another source of undefined behavior.

brvtalcake commented 11 months ago

I'm really "happy" to see this issue here because I had the same problem 20 minutes ago concerning LTO (wrapper functions not found). So yeah, maybe it could be cool to add an option to the pico sdk to support LTO (something like PICO_SDK_LTO which would add -DPICO_SDK_LTO to the compilation command-line to conditionally define WRAPPER_FUNC to __attribute__((used)) __wrap_ ## x). However, I guess inconsistent behavior when enabling LTO may rather be caused by unwanted optimizations (I don't know exactly how nor why) or wrong code (but I think pico-sdk's code has been sufficiently tested to verify such things) rather than data races (at least in this case, since there's apparently no multicore code nor multithreading possibilities)

kilograham commented 11 months ago

Note #97 for LTO

I understand that LTO/CMAKE_INTERPROCEDURAL_OPTIMIZATION is not officially supported, however, the fact that it fails with LTO enabled may point to an underlying problem such as a data race or another source of undefined behavior.

are you saying it doesn't fail without LTO

tttapa commented 11 months ago

However, I guess inconsistent behavior when enabling LTO may rather be caused by unwanted optimizations (I don't know exactly how nor why) or wrong code (but I think pico-sdk's code has been sufficiently tested to verify such things)

In my experience, what you refer to as "unwanted" optimizations have almost exclusively been caused by invalid code invoking Undefined Behavior.
For correct code without UB, the compiler and the optimizer should not alter the observable behavior of the code (for the standard's definition of observable). As such, "unwanted" optimizations are not really a thing, it's just that if you write code that violates the preconditions of the optimizer, the result won't satisfy its postconditions either. This is of course ignoring the real possibility of compiler bugs, but those are relatively rare, and shouldn't usually be the first hypothesis.

UB is unfortunately not something you can rule out by testing: because of its very nature, problems may only come up in certain scenarios or at sufficiently high optimization levels.

rather than data races (at least in this case, since there's apparently no multicore code nor multithreading possibilities)

True, but in the cases where I encountered the bug, there have always been timer interrupts involved. In the example code I posted, the state of the timer/alarm pool has to be updated both from the main thread and in the interrupt handlers. Combined with the nondeterministic behavior, a data race doesn't seem unlikely, but there could of course be other causes.

are you saying it doesn't fail without LTO

Indeed. Or at least, I haven't observed the same glitch (yet?) without LTO enabled. Given the sporadic nature of the bug, it may just be more likely with LTO enabled, but not impossible without.
Even with LTO enabled, it occasionally takes two or three tries to reproduce the issue.

brvtalcake commented 11 months ago

@tttapa yeah indeed, maybe you're right, especially if it hasn't been tested with LTO enabled before

dpgeorge commented 9 months ago

It could be that enabling/disabling LTO is simply changing the timing and ordering of events, which makes the bug apparent (or not).

We are also seeing a very similar issue in MicroPython and I think we have tracked down the bug to the way the hardware timer target is being configured, sometimes an interrupt is missed and that puts things into an inconsistent state. See #1552.

peterharperuk commented 9 months ago

are you saying it doesn't fail without LTO

I just tried this and it seemed that the issue was ONLY reproducible with LTO

kilograham commented 3 weeks ago

The alarm code was rewritten for the last SDK, please check on that