dcd_nrf5x: race condition again

rgrr commented 3 months ago

Operating System

Others

Board

PCA10056

Firmware

Custom firmware which uses CDC-ACM for data transfer. On the other end BLE with old Nordic SDK.

TinyUSB is at the state as of 2024-08-26.

What happened ?

The test procedure does "connect CDC, get BLE device list, connect BLE, transfer little data, disconnect device, disconnect CDC". Not sure if the whole procedure is required, the "disconnect CDC, connect CDC" part seems to be the critical part.

After some iterations (may take a few hundred iterations, each around 15s), TinyUSB has a failed assertion. I nailed that down before already, see #2626, that disable/enable IRQ at the correct place solves the problem. But the reviewer (@kasjer) wasn't happy with the solution, because interrupts were blocked for a long time.

Now there is a little bit more time on my side and I will try to fix this differently.

How to reproduce ?

Test firmware/loop see above

Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)

I will try to produce a log, currently I can deliver some kind of system state via the debugger.

Failed assertion is in usbd.c:1322, "TU_ASSERT(_usbd_dev.ep_status[epnum][dir].busy == 0)"

If one follows the stack (attached), one can see that

tud_task_ext(): handles DCD_EVENT_XFER_COMPLETE, ep_addr==0x80, which means, that epnum==0 and ep_dir==1
usbd_control_xfer_cb() runs through until call to _status_stage_xact() (line 208)
_status_stage_xact() now wants to initiate usbd_edpt_xfer() with ep_addr==0
usbd_edpt_xfer() fails because this endpoint is still busy

My personal conclusion is, that the static _ctrl_xfer is in an unexpected state, just as if the order of events is important and disrupted by an interrupt which puts its own event in between.

Screenshots

Stack of the assertion:

I have checked existing issues, dicussion and documentation

[X] I confirm I have checked existing issues, dicussion and documentation.

kasjer commented 3 months ago

@rgrr If you would be willing to share code I can help with fixing this problem. I use TinyUSB with mynewt and I'm very much interested in fixing this before it affects me in the future.

rgrr commented 3 months ago

thank you @kasjer. Problem with the code is, that it is owned by my company. I have to strip everything down to a minimum test case. Will again need some time.

My gut feeling is currently telling me, that it has something to do with disconnect/connect and pending data transfers.

Currently catching the debug output, but if there is bad luck than the problem does not happen with debug log on.

kasjer commented 3 months ago

@rgrr I guess it's still easier for you to strip down code that you know that finds problem then for me to come up with the code that may never trigger bad condition. As soon as you have something to share let me know and we (you and me) can get to the bottom of this.

rgrr commented 3 months ago

thanks again @kasjer ... I agree and hope to find a simple example.

Small spoiler: disconnect/connect sequence must be the culprit, because EP=0 as the trace above showed, EP=2 is used for data transfer

rgrr commented 3 months ago

Haha... and I already wanted to check the "friendly" "patch".

Now I have a debug log. One which is ok, the other run ning into the bug. They are more or less the same until line 113 where the buggy log queues another request while the "ok" log completes a transfer.

See yourself if you want.

usbbug.txt usbok.txt

The "ZZZZ" messages are here

bool dcd_edpt_xfer(uint8_t rhport, uint8_t ep_addr, uint8_t* buffer, uint16_t total_bytes) {
....
  if (control_status) {
    kernelprintf("ZZZZ1 %d\n", is_in_isr());

    // Status Phase also requires EasyDMA has to be available as well !!!!
    edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);

    // The nRF doesn't interrupt on status transmit so we queue up a success response.
    dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());

    kernelprintf("ZZZZ2 %d\n", is_in_isr());

I hoped to see something about interrupts, but did not

kasjer commented 3 months ago

@rgrr I'm processing your logs to understand the issue, if you can share some code it could help (especially host part). If you have traces from USB or logic analyzers it could also help understand timing. My wild guess right now is DMA access which must be serialized and can affect sequence of events when DMA is not ready at some point.

rgrr commented 3 months ago

@kasjer : sorry, both sides are more or less complex. The host side requires a working client, so no easy way.

I have reintroduced the disable/enable interrupt around the above sequence and again the problem disappeared.

But my first interpretation was wrong: the function is not called when it is already active (never observed active interrupt in this function til now). The actual problem is, that an interrupt between the two calls edpt_dma_start() and dcd_event_xfer_complete() inserts something it should not do.

So I'm currently trying to find out, what it could be. And my first try was to exchange the two calls.

rgrr commented 3 months ago

Hmmmm... wondering, if dcd_event_handler() has to be thread safe...

rgrr commented 3 months ago

Exchanging the two call seems to solve my issue. Currently have 3000 iterations of my test loop. With the original code the bug happens latest after 300 iterations.

rgrr commented 3 months ago

@kasjer : Update: it ran all night, had around 20000 CDC disconnect/reconnect cycles (the "bad" original needed around 300+/- cycles to run into the assertion). I count this as verified.

Do you have any explanation?

If you agree, I will prepare a pull request.

kasjer commented 3 months ago

I see what you mean. I don't see any drawback of moving dcd_event_xfer_complete() to be executed before edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS); I'm still trying to figure out how original order breaks the flow.

kasjer commented 3 months ago

@rgrr could you please modify line

TU_LOG_USBD("  Queue EP %02X with %u bytes ...\r\n", ep_addr, total_bytes);

to include information returned by is_in_isr() as you did with ZZZZ ?

kasjer commented 3 months ago

... and maybe dump content of _usbd_qdef and _usbd_qdef_buf after assertion failed to see if queue get corrupted somehow

rgrr commented 3 months ago

Ok... done that with surprising results:

  TU_LOG_USBD("  Queue EP %02X with %u bytes, is_in_isr %d ...\r\n", ep_addr, total_bytes, xx_is_in_isr());

never shows "in isr"!

Also the queue show some obscure values:

wr_idx/rd_idx are beyond depth!?

PS: OSAL is NONE.

kasjer commented 3 months ago

I was expecting some usbd_defer_func() call due to DMA access race but queue does not seem to have any. Now I'm inclined to thing that atomic_flag_test_and_set() is not working as expected. It would be nice to check if code ever tries to defer DMA. Maybe code generated for atomic_flag_test_and_set() in xact_out_dma() is somehow broken Dump of assembler code for function xact_out_dma:

=> 0x00001968 <+0>: push    {r4, r5, r6, r7}
   0x0000196a <+2>: ldr r1, [pc, #172]  @ (0x1a18 <xact_out_dma+176>)
   0x0000196c <+4>: dmb ish
   0x00001970 <+8>: add.w   r2, r1, #288    @ 0x120
   0x00001974 <+12>:    mov.w   r4, #1              // Block that checks if DMA is in use
   0x00001978 <+16>:    ldrexb  r3, [r2]            //
   0x0000197c <+20>:    strexb  r5, r4, [r2]        //
   0x00001980 <+24>:    cmp r5, #0                  //
   0x00001982 <+26>:    bne.n   0x1978 <xact_out_dma+16>
   0x00001984 <+28>:    dmb ish
   0x00001988 <+32>:    uxtb    r3, r3
   0x0000198a <+34>:    cbz r3, 0x19a4 <xact_out_dma+60>
   0x0000198c <+36>:    ldr r3, [pc, #140]  @ (0x1a1c <xact_out_dma+180>)
   0x0000198e <+38>:    ldr r2, [r3, #4]
   0x00001990 <+40>:    ubfx    r2, r2, #0, #9
   0x00001994 <+44>:    subs    r2, #0
   0x00001996 <+46>:    mov r1, r0
   0x00001998 <+48>:    pop {r4, r5, r6, r7}
   0x0000199a <+50>:    ldr r0, [pc, #132]  @ (0x1a20 <xact_out_dma+184>)
   0x0000199c <+52>:    it  ne
   0x0000199e <+54>:    movne   r2, #1
   0x000019a0 <+56>:    b.w 0x1880 <usbd_defer_func>
   0x000019a4 <+60>:    cmp r0, #8
   0x000019a6 <+62>:    ldr r3, [pc, #124]  @ (0x1a24 <xact_out_dma+188>)
   0x000019a8 <+64>:    bne.n   0x19e0 <xact_out_dma+120>
   0x000019aa <+66>:    ldr.w   r2, [r3, #1216] @ 0x4c0
   0x000019ae <+70>:    lsls    r0, r2, #15

@rgrr Can you verify that your function have similar pattern for atomic flags ?

HiFiPhile commented 3 months ago

wr_idx/rd_idx are beyond depth!?

For the fifo it's normal, it's using unmasked pointers to avoid 1 space wasted and decople write /read.

rgrr commented 3 months ago

@kasjer : the atomic flag functions used are

#define atomic_bool                    nrfx_atomic_flag_t
#define atomic_flag                    nrfx_atomic_flag_t

#define atomic_flag_clear(X)           nrfx_atomic_flag_clear(X)
#define atomic_flag_test_and_set(X)    nrfx_atomic_flag_set_fetch(X)

Code is as follows (sorry, I'm no Cortex-M assembler freak):

          xact_out_dma:
00031be4:   push    {r7, lr}
00031be6:   mov     r7, sp
00031be8:   sub     sp, #16
00031bea:   strb.w  r0, [r7, #-5]
 193        xfer_td_t* xfer = get_td(epnum, TUSB_DIR_OUT);
00031bee:   ldrb.w  r0, [r7, #-5]
00031bf2:   movs    r1, #0
00031bf4:   bl      0x31b72 <get_td>
00031bf8:   str     r0, [sp, #4]
 198        if (atomic_flag_test_and_set(&_dcd.dma_running)) {
00031bfa:   movw    r0, #13428      @ 0x3474
00031bfe:   movt    r0, #8192       @ 0x2000
00031c02:   add.w   r0, r0, #288    @ 0x120
00031c06:   bl      0x374fc <nrfx_atomic_flag_set_fetch>
00031c0a:   cbz     r0, 0x31c34 <xact_out_dma+80>
00031c0c:   b.n     0x31c0e <xact_out_dma+42>
 199          usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c0e:   ldrb.w  r1, [r7, #-5]
00031c12:   movw    r0, #60676      @ 0xed04
00031c16:   movt    r0, #57344      @ 0xe000
 149        return (SCB->ICSR & SCB_ICSR_VECTACTIVE_Msk) ? true : false;
00031c1a:   ldr     r2, [r0, #0]
00031c1c:   bfc     r2, #9, #23
00031c20:   cmp     r2, #0
00031c22:   it      ne
00031c24:   movne   r2, #1
 199          usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c26:   movw    r0, #7625       @ 0x1dc9
00031c2a:   movt    r0, #3
00031c2e:   bl      0x368a4 <usbd_defer_func>
 200          return;

I have inserted my ZZZZ debug output again, this time including _dcd.dma_running.

kasjer commented 3 months ago

@rgrr could you please show assembler for nrfx_atomic_flag_set_fetch() and nrfx_atomic_u32_fetch_or() this later function can be compiled in different variants so it's just to check which way for atomic manipulation is used in your code. One version can disables interrupts while other may use ldrex instructions

rgrr commented 3 months ago

@kasjer here they are:

          nrfx_atomic_u32_fetch_or:
000374a8:   sub     sp, #20
000374aa:   str     r0, [sp, #16]
000374ac:   str     r1, [sp, #12]
 99           NRFX_ATOMIC_OP(orr, old_val, new_val, p_data, value);
000374ae:   ldr     r0, [sp, #16]
000374b0:   ldr     r1, [sp, #12]
000374b2:   ldrex   r2, [r0]
000374b6:   orr.w   r3, r2, r1
000374ba:   strex   r12, r3, [r0]
000374be:   teq     r12, #0
000374c2:   bne.n   0x374b2 <nrfx_atomic_u32_fetch_or+10>
000374c4:   str     r2, [sp, #8]
000374c6:   str     r3, [sp, #4]
000374c8:   str.w   r12, [sp]
101           return old_val;
000374cc:   ldr     r0, [sp, #8]
000374ce:   add     sp, #20
000374d0:   bx      lr

...

          nrfx_atomic_flag_set_fetch:
000374fc:   push    {r7, lr}
000374fe:   mov     r7, sp
00037500:   sub     sp, #8
00037502:   str     r0, [sp, #4]
379           return nrfx_atomic_u32_fetch_or(p_data, 1);
00037504:   ldr     r0, [sp, #4]
00037506:   movs    r1, #1
00037508:   bl      0x374a8 <nrfx_atomic_u32_fetch_or>
0003750c:   add     sp, #8
0003750e:   pop     {r7, pc}

In the debug output at ZZZZ, _dcd.dma_running is always false.

Now compiled with "NRFX_ATOMIC_USE_BUILT_IN=1" and checking.

kasjer commented 3 months ago

This looks OK, so it must be something else.

kasjer commented 3 months ago

I think I understand the problem. Your solution to swap lines is correct.

Here is what happens:

Set Line Coding request is received
Task schedules read of data (7 bytes)
Task also schedules write to EP 82 (14 bytes)
Interrupt send notification to Task that 7 bytes transfer is done
Task wants to send status (0 bytes that does not takes DMA for long)
Task sends 0 bytes data packet
Interrupt fires probably for EP 82 informing that packet was received by host and send transfer complete event
Interrupt fires again this time is new setup request (host received 0 bytes packet after all) and sends setup request event
Task continues and sends to itself transfer complete event for Set Line Coding
Task process queue starting from EP 82 transfer complete
Task process out of order setup request Get Line Coding
Task now sees old transfer complete for Set Line Coding but probably confuses it with last request eventually

So changing order of lines seems the right thing to do. @rgrr Thanks you for finding it and solution.

kasjer commented 3 months ago

Now that I know how it works it's easy to reproduce not even data endpoints traffic is needed. Adding TU_LOG() between those lines will suffice to trigger the problem.

  if (control_status) {
    // Status Phase also requires EasyDMA has to be available as well !!!!
    edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);

    TU_LOG2("Not so fast\n");

    // The nRF doesn't interrupt on status transmit so we queue up a success response.
    dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());

Testing app

import serial

tty="COM30"
SERIAL_BAUDRATE=1000000

for i in range(0, 20000) :
    ser = serial.Serial(port=tty,
                        baudrate=SERIAL_BAUDRATE, timeout=0.01)
    ser.close()

rgrr commented 3 months ago

Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.

Should I prepare a PR?

kasjer commented 3 months ago

Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.

Should I prepare a PR?

Please do.

hathach commented 2 months ago

thank you @rgrr and @kasjer for very effective troubleshooting analysist

hathach / tinyusb