hathach / tinyusb

An open source cross-platform USB stack for embedded system
https://www.tinyusb.org
MIT License
4.99k stars 1.05k forks source link

Modified CDC example does not echo back with -O1 and -flto #1472

Open gemarcano opened 2 years ago

gemarcano commented 2 years ago

Operating System

Linux

Board

Custom STM32L052k8 board (pretty much just a STM32L052k8 with headers)

Firmware

Based on examples/device/cdc_msc_freertos code, but modified to expose an HID gamepad instead of a mass storage device.

Code is here: https://gitlab.com/gemarcano/stm32_snes_controller/-/tree/b2c5a2273929153d082b73aa137b25dd956410a0

This uses meson to build, and it relies on FreeRTOS and a separate stm32 hal and start files being available as libraries (https://gitlab.com/gemarcano/stm32_cmsis_headers https://gitlab.com/gemarcano/stm32_freertos https://gitlab.com/gemarcano/stm32_stm32l0xx_hal_driver). Sorry, my build documentation is lacking, as I've very much been using these projects as learning examples for the STM32 chip I have. In summary, I have a custom prefix /home/me/.local/stm32l052k8 where I'm installing all STM32 includes and libraries, and then when cross-compiling executables I point the build system to said prefix.

I attempted to reproduce the issue with the example cdc freertos device (as I've added a custom board, commit can be found at https://github.com/gemarcano/tinyusb), but I was not able to, even by playing with optimization parameters. It appears that whatever is happening is very, very code-dependent (as I'll also show later).

Since this probably also matters, I am building my code using a Gentoo Linux Crossdev generated arm-none-eabi-gcc compiler, with version arm-none-eabi-gcc (Gentoo 11.3.0 p4) 11.3.0

What happened ?

After flashing the modified CDC FreeRTOS example program onto my board, I am able to connect to ttyACM0, but I do not receive any echo from the device. As far as I can tell the software is running, as I can connect via SWD and step through code, but it seems caught in a loop checking for a status that never appears to be set. Specifically:

    for(;;)
    {
        if ( tud_cdc_connected() )
        {
            // There are data available
            uint32_t count = 0;
            while ( tud_cdc_available() )
            {
                uint8_t buf[64];

                // read and echo back
                count = tud_cdc_read(buf, sizeof(buf));

                // Echo back
                // Note: Skip echo by commenting out write() and write_flush()
                // for throughput test e.g
                //    $ dd if=/dev/zero of=/dev/ttyACM0 count=10000
                tud_cdc_write(buf, count);
            }
            tud_cdc_write_flush();
        }
    }

tud_cc_connected() never returns true. As best as I was able to deduce what GCC was doing after taking a look at the dissassebly in Ghidra, it very much looks like GCC is assuming that the variable _cdcd_itf.line_state cannot change from under the function (which is wrong in the case of FreeRTOS, as it is changed in another task as USB events are processed). Here's the relevant disassembly from Ghidra: image

Specifically, notice how the value at _cdc_itf[4] (or line_state) is loaded at address 0x8004410, but then it is never loaded from again in the highlighted loop (LAB_08004416 and the beq to the label), where said value is checked against. Interestingly enough, the suspend flag in _usbd_dev doesn't have this problem, as it's marked volatile, and GCC correctly made sure that its value is always fetched from RAM.

The bug specifically is that the current code in class/cdc/cdc_device.c doesn't really do anything to deal with the possibility of the different functions touching/reading _cdcd_itf being in different FreeRTOS tasks. Specifically in my example, my usb_device_task, which calls tud_task and tud_cdc_write_flush, updates _cdcd_itf.line_state, and my cdc_task then tries to read from it in tud_cdc_connected. As GCC has no way of knowing these functions are in different tasks/threads, it assumes that the data in _cdcd_itf.line_state can't change, so it just caches it in a register.

The real fix for this, I think, would be to through at a minimum _cdcd_itf.line_state if not all of _cdcd_itf behind some kind of mutex. However, for most MCUs, including mine that doesn't have multiple cores or virtual memory, setting line_state to volatile should fix the issue (and it does work for me). volatile won't be enough for devices with multiple cores and caches, though.

How to reproduce ?

  1. Pray that GCC 11.3's optimizer is in a bad mood
  2. Build the modified CDC example with -flto and at least -O1 (-O2 and -O3 also have the problem, -Og does not).
  3. Flash the program to an STM32L052K8
  4. Connect over USB to the CDC interface
  5. Notice that no echo is received (if the GCC optimizer really tried)

3a. Alternatively, look at the generated assembly and identify problems in the cdc_task function as shown in the "What happened?" section.

Debug Log as txt file

No response

Screenshots

No response

hathach commented 2 years ago

this doesn't seem to be tinyusb bug, could be your modified code or setup/integration. Could you confirm if the cdc issue occur with

since that is what I am using to test, and won't use any other setup/toolchain.

gemarcano commented 2 years ago

The bug goes away if all I do is reduce optimization to -00 or -0g, or add -fno-inline-functions-called-once to the compilation flags, which is indicative of undefined behavior. And again, if all I do is make _cdcd_itf[0].line_state volatile, the bug also disappears regardless of optimization level. This makes me believe the problem is very much what I already analyzed, that line_state needs some kind of synchronization between tasks if it's to be used in multiple tasks at once.

With the default Makefile system used by tinyusb, I'm fairly sure the bug is not reproducible as -Os probably prevents the inlining required to trigger this particular bug. I will try again to reproduce using one of the examples and forcing -O3 (I'll also have to add -Wno-null-dereference as apparently a higher optimization level is triggering GCC to warn about a null-derefence at src/device/usbd.c:413:11: error: potential null pointer dereference [-Werror=null-dereference]).

I've also noticed that adding some kinds of functions (e.g. taskYIELD() or vTaskDelay(1)) in the outer-most loop causes GCC to change code generation, causing the bug to not manifest itself (the dissasembly shows GCC reaching out to RAM each loop).

I've already shown the (or a) bug with the code: _cdcd_itf[0].line_state is being accessed from two different tasks, and GCC is within its rights to optimize away/cache non-volatile loads as it has no way of knowing these functions are running as separate tasks/processes (https://developer.arm.com/documentation/dui0472/c/compiler-coding-practices/compiler-optimization-and-the-volatile-keyword). In fact, I've shown this exact optimization being done by GCC in my generated binary in the Ghidra output.

Also, if the problem is a stale line_state, it makes sense that if all I do is comment out the if ( tud_cdc_connected() ) check (I had uncommented it as minicom/linux? appears to set the DTR bit correctly), I am able to get echos back.