raspberrypi / pico-sdk

BSD 3-Clause "New" or "Revised" License
3.24k stars 837 forks source link

Single-core FreeRTOS, ISR stops running on the non-FreeRTOS core #1751

Open slimhazard opened 2 days ago

slimhazard commented 2 days ago

This problem was discussed on the SDK forum here, where @kilograham thought it sounds like a bug.

The issue arises with single-core FreeRTOS (configNUMBER_OF_CORES=1), starting the task scheduler on core0, and an ISR running on core1. The same code was also tested with threadsafe background mode, poll mode, and SMP FreeRTOS, and the problem doesn't happen in any of those versions.

On core1, the ADC is set to free-running mode, and this ISR is initiated to trigger on ADC_IRQ_FIFO. It saves the value from adc_fifo_get() in a static variable:

static volatile uint16_t temp_adc_raw;
static volatile unsigned long isr_runs = 0;
static volatile absolute_time_t last_isr_t;

static void
__time_critical_func(temp_isr)(void)
{
    uint16_t val = adc_fifo_get();
    critical_section_enter_blocking(&temp_critsec);
    temp_adc_raw = val;
    critical_section_exit(&temp_critsec);
    isr_runs++;
    last_isr_t = get_absolute_time();
}

isr_runs and last_isr_t are debugging variables, to count the number of ISR runs and the time of the last run. The critical section is initialized in main().

On core0, the FreeRTOS task scheduler runs a task that gets a WiFi connection and initializes lwIP callbacks. The lwIP calls are then managed by the task scheduler. The app uses this function to retrieve the latest ADC value, called from the lwIP callbacks on core0:

uint32_t
get_temp(void)
{
    uint16_t raw;

    critical_section_enter_blocking(&temp_critsec);
    raw = temp_adc_raw;
    critical_section_exit(&temp_critsec);
    printf("isr_runs=%u last_isr_t=%llu\n", isr_runs,
           to_us_since_boot(last_isr_t));

    /* ... do some computation and return a value ... */
}

The printf output shows how often the ISR ran, and when it ran last in µs since boot.

In the problematic version of the app, when a network client makes a request and the lwIP callback calls get_temp(), the output shows that the ISR ran a few thousand times, then stopped:

isr_runs=12133 last_isr_t=5549347

These values never increase. The effect is that the network client gets back the value that was saved on the last ISR run, rather than continuously updated values derived from the ADC.

When I run the app in the other modes (threadsafe background, poll, and SMP FreeRTOS), the debug outputs always increase, and the network client gets the most recent value from the ADC.

This has been tested with SDK 1.5.1 and both of FreeRTOS V11.1.0 (current latest) and V10.6.2, which was the last version before the smp branch was merged to main. The same thing happened with both FreeRTOS versions.

In the same app, I have an async_context at-time worker that is also started on core1. It does the intercore synchronization in the same way as with the ISR -- the recurring worker function saves a static value, and the lwIP calls from core0 retrieve it, protected by a critical section. The network client always gets an updated value, including in the single-core FreeRTOS version. So although the ISR has stopped, the at-time worker continues running.

There is evidence that the ISR may have stopped around the time that the WiFi connection reaches the linkup state (which is done by the initializer task run by FreeRTOS on core0). I'm also saving a timestamp right after cyw43_tcpip_link_status() returns CYW43_LINK_UP, whose us_since_boot value is shown here as t:

t=5549470
[...]
isr_runs=12133 last_isr_t=5549347

The difference between the LINK_UP timestamp and the last ISR timestamp was 123µs in this case. The deltas I've seen here are usually in the 100s of µs range, always <= 1ms.

So to summarize:

slimhazard commented 2 days ago

For background, this is how the ADC and ISR are initiated on core1 -- nothing surprising, I think:

    adc_init();
    adc_set_temp_sensor_enabled(true);
    adc_select_input(ADC_TEMP_CH);
    /* Write to FIFO length 1, and retain the ERR bit. */
    adc_fifo_setup(true, false, 1, true, false);
    adc_set_clkdiv(ADC_MAX_CLKDIV);

    irq_set_exclusive_handler(ADC_IRQ_FIFO, temp_isr);
    adc_irq_set_enabled(true);
    irq_set_enabled(ADC_IRQ_FIFO, true);
    adc_run(true);

ADC_MAX_CLKDIV is set to the largest possible divider for the ADC (65535.f + 255.f/256.f) -- run as slowly as possible.

In case you're wondering what it's all for, I'm developing the FreeRTOS apps for this project:

https://gitlab.com/slimhazard/picow-http-example

Which is just sample applications to demonstrate the HTTP library in this project:

https://gitlab.com/slimhazard/picow_http

So the lwIP callbacks implement the http server. The idea being that the server can be used with any choice of a cyw43_arch, and the sample apps show you how. Up to now, I've been saying that the HTTP lib doesn't support FreeRTOS, but that limitation can be removed now.

The problem may be a bug of my own, of course, but I suspect that it has less to do with the specifics of the HTTP library, and more about how to initiate WiFi and lwIP with FreeRTOS on one core, and run an ISR without FreeRTOS on the other core.

slimhazard commented 2 days ago

I've found a workaround, or a fix, depending on how you look at it.

I've added a semaphore that ADC initialization (as shown in the last comment) now waits for before doing anything else:

    sem_acquire_blocking(up);
    adc_init();
    /* ... and so forth as above ... */

The semaphore is signaled from the core0 initialization task after the network reaches the linkup state. So the ADC/IRQ initialization does not proceed until after that happens.

With that, everything works as expected. The debugging output shows that the number of ISR runs and the time of the last run increase on every get_temp() call, and the network client gets up-to-date values that originated from the ADC.

So the solution appears to be: don't start an ISR on a non-FreeRTOS core until the network is brought up by a task running on the FreeRTOS core. That is evidently only necessary in the single-core FreeRTOS configuration.

(I suspect that the ADC init code can run before waiting, and only the ISR enable has to wait for the semaphore.)

I'd say this is either a workaround for a bug in FreeRTOS or in the RP2040 port, or a fix for a bug in my own code, depending on whether this is considered works as designed for FreeRTOS and/or the port.

I personally would say that it's worth looking into why this happens in the port, since the wait is only necessary in the one specific configuration. I wouldn't have expected that an ISR on one core gets stopped unless it waits until after the link is brought up on the other core.

kilograham commented 2 days ago

It is surprising that this works on SMP and not otherwise; the most common issue is with IRQ setup on SMP if you don't pin the setup task, as it may migrate from one core to the other at inconvenient times when not pinned.

Can you attach a ZIP for repro?

slimhazard commented 2 days ago

The workaround doesn't work as well as I thought. The code sample from my previous comment left out debugging output that I believed to be irrelevant:

    sem_acquire_blocking(up);
    puts("Doesn't work without this output");

But as the message says, if I leave out the puts() line, the fix doesn't work. I tried a few things to fix that, but can't explain it.

@kilograham I pushed a branch just for this issue, can you use that to try to reproduce?

https://gitlab.com/slimhazard/picow-http-example/-/tree/issue-1751

If so, there are some build requirements:

https://gitlab.com/slimhazard/picow_http/-/wikis/required-software

At minimum you need gperf and the four Python classes. I don't think that's too much of a PITA, but I guess I'm biased.

I think you'll be all right with the Quick Start:

https://gitlab.com/slimhazard/picow-http-example/-/tree/issue-1751#quick-start

cmake -DWIFI_SSID and -DWIFI_PASSWORD are just like in pico-examples. You only need -DHOSTNAME to test the TLS examples, and I wouldn't bother, that has nothing to do with the issue. -DCMAKE_BUILD_TYPE=Debug helps with some extra debug output.

The four relevant test apps are:

When you deploy one of those and if all goes well, you first see progress getting the wifi connection in the console, and eventually "http started". Then you can point a browser to http://picow (or whatever host name you chose), and you should see something like this:

https://gitlab.com/slimhazard/picow-http-example/-/wikis/Screenshots

The temperature value gets updated by Javascript every few seconds, or you can invoke it directly with curl http://picow/temp. The issue here was that the value should change occasionally; when the ISR stopped, it never changed. As you know, the on-die sensor doesn't change especially rapidly, even when all is working well. If I want it change faster, I put my finger on the RP2040 or blow a fan on it to make it warmer or cooler.

Let me know if you have any questions.