Closed rgrr closed 3 months ago
@rgrr If you would be willing to share code I can help with fixing this problem. I use TinyUSB with mynewt and I'm very much interested in fixing this before it affects me in the future.
thank you @kasjer. Problem with the code is, that it is owned by my company. I have to strip everything down to a minimum test case. Will again need some time.
My gut feeling is currently telling me, that it has something to do with disconnect/connect and pending data transfers.
Currently catching the debug output, but if there is bad luck than the problem does not happen with debug log on.
@rgrr I guess it's still easier for you to strip down code that you know that finds problem then for me to come up with the code that may never trigger bad condition. As soon as you have something to share let me know and we (you and me) can get to the bottom of this.
thanks again @kasjer ... I agree and hope to find a simple example.
Small spoiler: disconnect/connect sequence must be the culprit, because EP=0 as the trace above showed, EP=2 is used for data transfer
Haha... and I already wanted to check the "friendly" "patch".
Now I have a debug log. One which is ok, the other run ning into the bug. They are more or less the same until line 113 where the buggy log queues another request while the "ok" log completes a transfer.
See yourself if you want.
The "ZZZZ" messages are here
bool dcd_edpt_xfer(uint8_t rhport, uint8_t ep_addr, uint8_t* buffer, uint16_t total_bytes) {
....
if (control_status) {
kernelprintf("ZZZZ1 %d\n", is_in_isr());
// Status Phase also requires EasyDMA has to be available as well !!!!
edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);
// The nRF doesn't interrupt on status transmit so we queue up a success response.
dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());
kernelprintf("ZZZZ2 %d\n", is_in_isr());
I hoped to see something about interrupts, but did not
@rgrr I'm processing your logs to understand the issue, if you can share some code it could help (especially host part). If you have traces from USB or logic analyzers it could also help understand timing. My wild guess right now is DMA access which must be serialized and can affect sequence of events when DMA is not ready at some point.
@kasjer : sorry, both sides are more or less complex. The host side requires a working client, so no easy way.
I have reintroduced the disable/enable interrupt around the above sequence and again the problem disappeared.
But my first interpretation was wrong: the function is not called when it is already active (never observed active interrupt in this function til now). The actual problem is, that an interrupt between the two calls edpt_dma_start() and dcd_event_xfer_complete() inserts something it should not do.
So I'm currently trying to find out, what it could be. And my first try was to exchange the two calls.
Hmmmm... wondering, if dcd_event_handler() has to be thread safe...
Exchanging the two call seems to solve my issue. Currently have 3000 iterations of my test loop. With the original code the bug happens latest after 300 iterations.
@kasjer : Update: it ran all night, had around 20000 CDC disconnect/reconnect cycles (the "bad" original needed around 300+/- cycles to run into the assertion). I count this as verified.
Do you have any explanation?
If you agree, I will prepare a pull request.
I see what you mean. I don't see any drawback of moving dcd_event_xfer_complete()
to be executed before edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);
I'm still trying to figure out how original order breaks the flow.
@rgrr could you please modify line
TU_LOG_USBD(" Queue EP %02X with %u bytes ...\r\n", ep_addr, total_bytes);
to include information returned by is_in_isr()
as you did with ZZZZ ?
... and maybe dump content of _usbd_qdef
and _usbd_qdef_buf
after assertion failed to see if queue get corrupted somehow
Ok... done that with surprising results:
TU_LOG_USBD(" Queue EP %02X with %u bytes, is_in_isr %d ...\r\n", ep_addr, total_bytes, xx_is_in_isr());
never shows "in isr"!
Also the queue show some obscure values:
wr_idx/rd_idx are beyond depth!?
PS: OSAL is NONE.
I was expecting some usbd_defer_func()
call due to DMA access race but queue does not seem to have any.
Now I'm inclined to thing that atomic_flag_test_and_set()
is not working as expected.
It would be nice to check if code ever tries to defer DMA.
Maybe code generated for atomic_flag_test_and_set()
in xact_out_dma()
is somehow broken
Dump of assembler code for function xact_out_dma:
=> 0x00001968 <+0>: push {r4, r5, r6, r7}
0x0000196a <+2>: ldr r1, [pc, #172] @ (0x1a18 <xact_out_dma+176>)
0x0000196c <+4>: dmb ish
0x00001970 <+8>: add.w r2, r1, #288 @ 0x120
0x00001974 <+12>: mov.w r4, #1 // Block that checks if DMA is in use
0x00001978 <+16>: ldrexb r3, [r2] //
0x0000197c <+20>: strexb r5, r4, [r2] //
0x00001980 <+24>: cmp r5, #0 //
0x00001982 <+26>: bne.n 0x1978 <xact_out_dma+16>
0x00001984 <+28>: dmb ish
0x00001988 <+32>: uxtb r3, r3
0x0000198a <+34>: cbz r3, 0x19a4 <xact_out_dma+60>
0x0000198c <+36>: ldr r3, [pc, #140] @ (0x1a1c <xact_out_dma+180>)
0x0000198e <+38>: ldr r2, [r3, #4]
0x00001990 <+40>: ubfx r2, r2, #0, #9
0x00001994 <+44>: subs r2, #0
0x00001996 <+46>: mov r1, r0
0x00001998 <+48>: pop {r4, r5, r6, r7}
0x0000199a <+50>: ldr r0, [pc, #132] @ (0x1a20 <xact_out_dma+184>)
0x0000199c <+52>: it ne
0x0000199e <+54>: movne r2, #1
0x000019a0 <+56>: b.w 0x1880 <usbd_defer_func>
0x000019a4 <+60>: cmp r0, #8
0x000019a6 <+62>: ldr r3, [pc, #124] @ (0x1a24 <xact_out_dma+188>)
0x000019a8 <+64>: bne.n 0x19e0 <xact_out_dma+120>
0x000019aa <+66>: ldr.w r2, [r3, #1216] @ 0x4c0
0x000019ae <+70>: lsls r0, r2, #15
@rgrr Can you verify that your function have similar pattern for atomic flags ?
wr_idx/rd_idx are beyond depth!?
For the fifo it's normal, it's using unmasked pointers to avoid 1 space wasted and decople write /read.
@kasjer : the atomic flag functions used are
#define atomic_bool nrfx_atomic_flag_t
#define atomic_flag nrfx_atomic_flag_t
#define atomic_flag_clear(X) nrfx_atomic_flag_clear(X)
#define atomic_flag_test_and_set(X) nrfx_atomic_flag_set_fetch(X)
Code is as follows (sorry, I'm no Cortex-M assembler freak):
xact_out_dma:
00031be4: push {r7, lr}
00031be6: mov r7, sp
00031be8: sub sp, #16
00031bea: strb.w r0, [r7, #-5]
193 xfer_td_t* xfer = get_td(epnum, TUSB_DIR_OUT);
00031bee: ldrb.w r0, [r7, #-5]
00031bf2: movs r1, #0
00031bf4: bl 0x31b72 <get_td>
00031bf8: str r0, [sp, #4]
198 if (atomic_flag_test_and_set(&_dcd.dma_running)) {
00031bfa: movw r0, #13428 @ 0x3474
00031bfe: movt r0, #8192 @ 0x2000
00031c02: add.w r0, r0, #288 @ 0x120
00031c06: bl 0x374fc <nrfx_atomic_flag_set_fetch>
00031c0a: cbz r0, 0x31c34 <xact_out_dma+80>
00031c0c: b.n 0x31c0e <xact_out_dma+42>
199 usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c0e: ldrb.w r1, [r7, #-5]
00031c12: movw r0, #60676 @ 0xed04
00031c16: movt r0, #57344 @ 0xe000
149 return (SCB->ICSR & SCB_ICSR_VECTACTIVE_Msk) ? true : false;
00031c1a: ldr r2, [r0, #0]
00031c1c: bfc r2, #9, #23
00031c20: cmp r2, #0
00031c22: it ne
00031c24: movne r2, #1
199 usbd_defer_func((osal_task_func_t) xact_out_dma_wrapper, (void*) (uint32_t) epnum, is_in_isr());
00031c26: movw r0, #7625 @ 0x1dc9
00031c2a: movt r0, #3
00031c2e: bl 0x368a4 <usbd_defer_func>
200 return;
I have inserted my ZZZZ debug output again, this time including _dcd.dma_running.
@rgrr could you please show assembler for nrfx_atomic_flag_set_fetch()
and nrfx_atomic_u32_fetch_or()
this later function can be compiled in different variants so it's just to check which way for atomic manipulation is used in your code.
One version can disables interrupts while other may use ldrex instructions
@kasjer here they are:
nrfx_atomic_u32_fetch_or:
000374a8: sub sp, #20
000374aa: str r0, [sp, #16]
000374ac: str r1, [sp, #12]
99 NRFX_ATOMIC_OP(orr, old_val, new_val, p_data, value);
000374ae: ldr r0, [sp, #16]
000374b0: ldr r1, [sp, #12]
000374b2: ldrex r2, [r0]
000374b6: orr.w r3, r2, r1
000374ba: strex r12, r3, [r0]
000374be: teq r12, #0
000374c2: bne.n 0x374b2 <nrfx_atomic_u32_fetch_or+10>
000374c4: str r2, [sp, #8]
000374c6: str r3, [sp, #4]
000374c8: str.w r12, [sp]
101 return old_val;
000374cc: ldr r0, [sp, #8]
000374ce: add sp, #20
000374d0: bx lr
...
nrfx_atomic_flag_set_fetch:
000374fc: push {r7, lr}
000374fe: mov r7, sp
00037500: sub sp, #8
00037502: str r0, [sp, #4]
379 return nrfx_atomic_u32_fetch_or(p_data, 1);
00037504: ldr r0, [sp, #4]
00037506: movs r1, #1
00037508: bl 0x374a8 <nrfx_atomic_u32_fetch_or>
0003750c: add sp, #8
0003750e: pop {r7, pc}
In the debug output at ZZZZ, _dcd.dma_running is always false.
Now compiled with "NRFX_ATOMIC_USE_BUILT_IN=1" and checking.
This looks OK, so it must be something else.
I think I understand the problem. Your solution to swap lines is correct.
Here is what happens:
So changing order of lines seems the right thing to do. @rgrr Thanks you for finding it and solution.
Now that I know how it works it's easy to reproduce not even data endpoints traffic is needed. Adding TU_LOG() between those lines will suffice to trigger the problem.
if (control_status) {
// Status Phase also requires EasyDMA has to be available as well !!!!
edpt_dma_start(&NRF_USBD->TASKS_EP0STATUS);
TU_LOG2("Not so fast\n");
// The nRF doesn't interrupt on status transmit so we queue up a success response.
dcd_event_xfer_complete(0, ep_addr, 0, XFER_RESULT_SUCCESS, is_in_isr());
Testing app
import serial
tty="COM30"
SERIAL_BAUDRATE=1000000
for i in range(0, 20000) :
ser = serial.Serial(port=tty,
baudrate=SERIAL_BAUDRATE, timeout=0.01)
ser.close()
Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.
Should I prepare a PR?
Haha, that's a simple testcase! And a very complicated sequence for the bug to happen.
Should I prepare a PR?
Please do.
thank you @rgrr and @kasjer for very effective troubleshooting analysist
Operating System
Others
Board
PCA10056
Firmware
Custom firmware which uses CDC-ACM for data transfer. On the other end BLE with old Nordic SDK.
TinyUSB is at the state as of 2024-08-26.
What happened ?
The test procedure does "connect CDC, get BLE device list, connect BLE, transfer little data, disconnect device, disconnect CDC". Not sure if the whole procedure is required, the "disconnect CDC, connect CDC" part seems to be the critical part.
After some iterations (may take a few hundred iterations, each around 15s), TinyUSB has a failed assertion. I nailed that down before already, see #2626, that disable/enable IRQ at the correct place solves the problem. But the reviewer (@kasjer) wasn't happy with the solution, because interrupts were blocked for a long time.
Now there is a little bit more time on my side and I will try to fix this differently.
How to reproduce ?
Test firmware/loop see above
Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)
I will try to produce a log, currently I can deliver some kind of system state via the debugger.
Failed assertion is in usbd.c:1322, "TU_ASSERT(_usbd_dev.ep_status[epnum][dir].busy == 0)"
If one follows the stack (attached), one can see that
My personal conclusion is, that the static _ctrl_xfer is in an unexpected state, just as if the order of events is important and disrupted by an interrupt which puts its own event in between.
Screenshots
Stack of the assertion:
I have checked existing issues, dicussion and documentation