raspberrypi / pico-feedback

25 stars 2 forks source link

RP2040 USB Host does not correctly re-synchronize after Bulk IN DATA0/DATA1 PID error (see USB Specification 8.6.4) #394

Open rppicomidi opened 3 months ago

rppicomidi commented 3 months ago

I am using the Raspberry Pi Pico board's native USB Host hardware to implement a USB MIDI Host driver for TinyUSB in the usb_midi_host project. The driver uses the RP2040's USB Host hardware to implement one USB Bulk IN endpoint and one USB Bulk Out endpoint. A library user filed this issue. USB sniffer traces seem to indicate that the RP2040 USB Host hardware does not behave properly when it receives a packet with a DATA1 PID when the Host expects the packet to have a DATA0 PID. As I read USB 2.0 specification 8.6.4, the Host should send ACK to the USB Device and ignore the data in the packet. What the RP2040 USB Host seems to do is to send no handshake at all and ignore the data. This causes the USB Device to forever re-transmit the packet with the same unexpected DATA1 PID forever.

There does not seem to be any special configuration bit documented that would activate or deactivate proper USB Host behavior, but if it is a configuration error in software, that would be good to know. This behavior is not documented in the RP2040 Errata. Is this a known issue?

Incidentally, the Pico-PIO-USB USB Host mode does not have this problem; USB traces show the devices behaves correctly with my driver code.

P33M commented 3 months ago

In this case I would expect USB_INTE_ERROR_DATA_SEQ_BITS to be set, followed by a panic("Data Seq Error \n") - which is admittedly a bad way to handle it, but not an infinite retry.

Edit: pieced together it's a Bulk IN EP. This case is explicitly called out in the host hardware FSM - parsing it, I see that the PID is validated against the current buffer descriptor word, and if invalid then the hardware skips the remainder of the data packet and issues an ACK.

So that's a mystery.

rppicomidi commented 3 months ago

@P33M The code I am using configures two RP2040 Endpoint Control registers, for example, EP1 and EP2, for Bulk transfer instead of Interrupt transfer. Although the documentation states that the Endpoint 1-15 In Control regsiters are for Interrupt endpoint control, Bulk endpoints 1-15 are currently supported in the TinyUSB RP2040 HCD code by setting bits 27:26 to 2 (Bulk). Is the host mode FSM the same for EPx and as it is for the the other endpoints?

The documentation says SIE_STATUS is for EPx; there is no documented SIE_STATUS register for the other host endpoints as near as I can tell.

rppicomidi commented 3 months ago

This issue may be related to issue #231

rppicomidi commented 2 months ago

I have implemented a straightforward test procedure for verifying that this is a real bug. I hope this is useful to you. Hopefully, this issue is not in the new RP2350 chip.

This test requires a Raspberry Pi Pico board running usb_midi_host_example, a USB to UART adapter to give the Pico board a console display (I use a picoprobe), a USB Analyzer (I used a Pico board running usb-sniffer-lite), and a STM32F072B-Discovery board running TinyUSB midi_test example. I did the experiment in two parts:

  1. Install a USB MIDI Host program on a Pico board and verify it works normally when you plug in a custom USB MIDI device that is operating normally.
  2. Modify the USB MIDI device so that the first Bulk IN transfer on IN Endpoint 1 starts with a DATA1 PID instead of DATA0 PID. Use a sniffer to observe no ACK from the RP2040 host and an infinite loop of retries from the device. You will see something like this:
    1001 : SOF #246
     3 : IN: 0x01/1
     6 : DATA1 (4): 09 90 51 7f
    12 : IN: 0x01/1
    15 : DATA1 (4): 09 90 51 7f
    21 : IN: 0x01/1
    24 : DATA1 (4): 09 90 51 7f
    31 : IN: 0x01/1
    34 : DATA1 (4): 09 90 51 7f
    40 : IN: 0x01/1
    43 : DATA1 (4): 09 90 51 7f
    49 : IN: 0x01/1

Details of the experiment follow if you choose to do it. Part 1 Build the usb_midi_host_example code and load it onto the Pico board. Plug a Micro USB to USB A Adapter to the Pico board. Attach a serial port to USB adapter to the Pico board pins 1 and 2. Start a console program so you can observe the output from the usb_midi_host_example code. Connect a 5V power supply to the Pico board's VBus pin and any ground pin. Verify the usb_midi_host_example code is running (will display a message to the console).

Use git to install an instance of TinyUSB to a separate directory ${ST_TINYUSB}

cd ${ST_TINYUSB}
git clone https://github.com/hathach/tinyusb.git
cd ${ST_TINYUSB}/tinyusb/tools
python3 get_deps.py  stm32f0

Build the midi_test example

cd ${ST_TINYUSB}/tinyusb/examples/device/midi_test
make BOARD=stm32f072disco all

Use stlink-org or official ST tools to program the STM32F072B-Discovery board with ${ST_TINYUSB}/tinyusb/examples/device/midi_test/_build/stm32f072disco/usb_test.hex

Use a Mini USB to USB A cable to plug the STM32F072-Discovery board to the Pico board's Micro USB to USB A adapter. You should observe the console display the information about the midi_test USB device and then an unending stream of MIDI packets that is coming from the midi_test software on the STM32F072-Discovery board.

You can insert the USB sniffer between the two boards to observe the USB traffic as a baseline.

Part Two Apply this patch to ${ST_TINYUSB}/tinyusb

diff --git a/src/portable/st/stm32_fsdev/dcd_stm32_fsdev.c b/src/portable/st/stm32_fsdev/dcd_stm32_fsdev.c
index 252969648..85c37bdf0 100644
--- a/src/portable/st/stm32_fsdev/dcd_stm32_fsdev.c
+++ b/src/portable/st/stm32_fsdev/dcd_stm32_fsdev.c
@@ -607,8 +607,9 @@ bool dcd_edpt_open(uint8_t rhport, tusb_desc_endpoint_t const *desc_ep) {
   xfer->ep_idx = ep_idx;

   ep_change_status(&ep_reg, dir, EP_STAT_NAK);
-  ep_change_dtog(&ep_reg, dir, 0);
-
+  //ep_change_dtog(&ep_reg, dir, 0);
+  //corrupt the DTOG bit for EP 1 IN endpoint
+  ep_change_dtog(&ep_reg, dir, ep_addr == 0x81 ? 1:0);
   // reserve other direction toggle bits
   if (dir == TUSB_DIR_IN) {
     ep_reg &= ~(USB_EPRX_STAT | USB_EP_DTOG_RX);

Delete the ${ST_TINYUSB}/tinyusb/examples/device/midi_test/_build/ directory and rebuild midi_test.hex

Reflash the STM32F072B-Discovery board and plug it, via a USB sniffer, to the Pico board running usb_midi_host_example. The USB sniffer trace should show the DATA1 infinite loop as shown above.

P33M commented 2 months ago

Unfortunately I don't have any STM32 parts. I do have a simulation model of a CDC UART, which I can corrupt in almost the same fashion.

P33M commented 2 months ago

With a modified model that transmits a short DATA1 packet in response to the first IN for the endpoint, it does look like the host controller does something bad and it looks a lot like that sequence.

However data sequence errors have been seen in the field - once for a host control transfer where the data stage toggle was wrong, and for device OUT bulk transfers. So what's different about host bulk IN?

rppicomidi commented 2 months ago

@P33M The data sequence errors on the control endpoint are using EPx, which is polled under software control. The Host Bulk In uses EP1 and EP2, which are automatically polled. EPx uses registers SIE_CTRL, SIE_STATUS. EP1-EP15 are enabled using INT_EP_CTRL; the control registers for these endpoints are in memory as described in the RP2040 datasheet section "4.1.2.5.2. Layout." I do not know how those differ internally. One hint is I do not believe the the SIE_STATUS register updates in host mode for any register other than EPx, which is used for the control endpoint exclusively in TinyUSB. So if the data sequence error recovery depends on the status of a bit in the SIE_STATUS, then it probably won't work.

I suspect that if I try a USB HID host example and make a USB HID device that has a data sequence error on Interrupt endpoint EP1, then I will see the same thing. Is that an experiment you can easily try there?

P33M commented 2 months ago

I believe the issue stems from the fact that the endpoint type is ignored and the endpoint index is used as a proxy as to whether or not to treat it as "interrupt". Can you try modifying your MIDI device for Interrupt IN?

rppicomidi commented 2 months ago

Behavior is the same for the Interrupt endpoint MIDI device and the Bulk endpoint MIDI device.

     3 : IN: 0x01/1
     6 : DATA1 (4): 09 90 51 7f 
    12 : IN: 0x01/1
    15 : DATA1 (4): 09 90 51 7f 
    22 : IN: 0x01/1
    24 : DATA1 (4): 09 90 51 7f 

See attached for the full trace.

MIDI_dev_int_epts.zip

rppicomidi commented 2 months ago

@P33M The data seems to show that using the auto-sequenced EP1-EP15 in host mode is not working properly for either Interrupt or Bulk endpoints in the face of data sequence errors. I am thinking about ways forward. About 3 years ago, I had a host controller driver that only used EPx for Bulk endpoints. It was buggy, and was crafted just for MIDI host, but I think I could do better now. On the other hand, I don't want to go through the effort of reviving that project if it will be a waste of time. Do you have simulation data that shows that an IN endpoint transaction that uses EPx instead of EP1-EP15 behaves correctly in the face of a data sequence error (i.e., just send ACK, drop the data, and set the DATA_SEQ_ERROR bit in the SIE_STATUS register)?

P33M commented 2 months ago

By inspection the path the EP1-EP15 transfers take is the same whether they are to bulk or interrupt endpoints. Isochronous should not be affected by this bug.

There are workarounds -

rppicomidi commented 2 months ago

@P33M Thank you for your workarounds! They do trigger more questions from me. Sorry.

For the first workaround: since Interrupt and Bulk endpoints behave the same way, wouldn't I need to route all Interrupt and Bulk traffic through EPx?

For the second and third workaround: Is this only ever a first packet issue, or would a random data sequence error caused by the device missing an ACK from the host also kick off the infinite IN with no ACK error?

For the last workaround: How does the host know that there is a data sequence error on an EPn endpoint? Isn't the SIE_STATUS register is tied to EPx and not the EPn endpoints? If I know the in advance that this issue is in an attached device (for example, by detecting the Device ID and Product ID of an Arturia BeatStep Pro), is this accomplished by writing 1 to EPn Buffer Control Register Bit 13?

Thank you again for your help with this.

P33M commented 2 months ago

Yes, the implication is that use of EPn is unsafe for interrupt or bulk if you're in a noisy environment where a Host ACK could get smashed. Unfortunately that means you need to maintain a software queue of interrupt endpoints, and enable the SOF interrupt to advance that queue.

The knowledge of "this endpoint belongs to a device with a wrong starting PID" has to be communicated from the interface driver when it requests the endpoint - but that means a global API change to work around a specific host bug.

rppicomidi commented 2 months ago

@P33M I have in my workspace implemented "Initialise EPn with a PID of 1, to accept the erroneous packet." and it does work. Thank you. Implementing the host controller driver that only uses EPx is a longer task.

rppicomidi commented 2 months ago

@P33M I started working on a new HCD that only uses the EPx and does not use the asynchronous endpoints. One obstacle to doing this is the automatic NAK retry controlled by the NAK_POLL interval. Because the HCD interrupt handler needs to allocate bandwidth on the bus now, the HCD has to handle the NAK interrupt. This is especially important for managing IN endpoints from HID devices with polling intervals longer than 1ms and for CDC devices, MIDI devices, etc. that have infrequent (relative to SOF interval) data availability and often just NAK.

Setting NAK_POLL to 0 does not disable it. Nor does setting it to 0x03FF03FF. There are several reserved bits in the SIE_CTRL register. Do any of those bits disable NAK_POLL and the automatic NAK retry feature (left in during development for testing purposes, for example)? If not, the HCD interrupt handler has to respond to a NAK interrupt by stopping the transfer (STOP_TRANS bit in SIE_CTRL). If so, please tell me what it is here. Also it would be good for future revisions of the datasheet to document this bit when it documents this chip erratum.

Thank you for your continued help on this issue.

P33M commented 2 months ago

RP2040 Host mode EPx Bulk polling is cumbersome, because as you observed it's not possible to cleanly stop polling the endpoint. Stopping the SIE also runs the risk of dropping a packet that was in-progress. For this reason RP2350 added a "stop EPx on NAK" feature with a variable polling count/timeout.

You can work around the infinite polling on RP2040 by treating Bulk transfers as Interrupt, so EPx will do exactly one handshake and then halt/interrupt the CPU. This has the drawback of requiring CPU intervention for every packet transferred (or NAK), but this approach allows for fair scheduling between endpoints.

rppicomidi commented 2 months ago

@P33M Is there a way for the HCD on the RP2040 to get an interrupt on NAK? I misread the datasheet my first go around. There seems to be a STALL_REC interrupt and a STALL_NAK interrupt (for devices) but no NAK_REC interrupt. The RP2350 has the EPX_STOPPED_ON_NAK interrupt. Would the HCD have to poll the SIE_STATUS register (e.g., in a frequently firing alarm interrupt or on the HOST_SOF interrupt) to find out if the NAK_REC bit is set?

P33M commented 2 months ago

On RP2040 if the Bulk endpoint is treated as Interrupt, EPx will stop after 1 handshake and you will get a TRANS_COMPLETE interrupt. NAK_REC in SIE_STATUS should be set alonside the interrupt flag in this case.

rppicomidi commented 2 months ago

Outstanding! Thank you.