hathach / tinyusb

An open source cross-platform USB stack for embedded system
https://www.tinyusb.org
MIT License
4.98k stars 1.05k forks source link

High-speed streaming capacity of CDC class #920

Open HiFiPhile opened 3 years ago

HiFiPhile commented 3 years ago

Is your feature request related to a problem? Please describe.

Raw throughput of USB can be high but once you make some data copy the speed will drop significantly. Currently there is no way to skip fifo & buffer in CDC class: USB HW Buffer ==> ep_out_buf ==> rx_ff ==> User app Besides DMA is not supported.

We had some discussions on this subject, for example expose the fifo or add xfer_fifo support.

Describe the solution you'd like Sort out a solution to make CDC class capable of high-speed streaming

hathach commented 3 years ago

This is already asked, but can you put up your board & setup and how you test the throughput and its result. This allow others to join and try with their own setup for comparison.

HiFiPhile commented 3 years ago

I've done some tests:

Config Throughput tud_task time (us) cdc_task time (us) Comment
Vanilla 6.87 MB/s 6.7-10 5.7
512B Echo Buf 15.54 MB/s 6.8-9.8 7.5 https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115
2048B All Bufs 25.67 MB/s 12.8 13.6 https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115 and tusb_config.h
512B cdc_task2 15.53 MB/s 6.6-10 4.8 Change buffer size in tusb_config.h and modify cdc_task as below
2048B cdc_task2 28.26 MB/s 12.8-16 4.9 Change buffer size in tusb_config.h and modify cdc_task as below
512B cdc_task2+ xfer_fifo 18.30 MB/s 6.8-8 4.8 Change buffer size in tusb_config.h, edpt_xfer_fifo not implemented yet, just a simulation
2048B cdc_task2+ xfer_fifo 30.69 MB/s 6.8-8 4.8 Change buffer size in tusb_config.h, edpt_xfer_fifo not implemented yet, just a simulation
4096B cdc_task2+ real xfer_fifo 34.52 MB/s 7.2-9 5.6 Minimum 4k buffer is required for dcd_transdimension, changes from https://github.com/HiFiPhile/tinyusb/tree/xfer_fifo_transdimension

cdc_task2:

With fifo exported from https://github.com/HiFiPhile/tinyusb/blob/60b41ffc1d35ceb8e7b25b9e6fb5202fe0778984/src/class/cdc/cdc_device.h#L108

void cdc_task2(void)
{
  if ( tud_cdc_available() )
  {
    tu_fifo_t* fifo = tud_cdc_get_rx_ff();
    tu_fifo_buffer_info_t info;
    tu_fifo_get_read_info(fifo, &info);
    // Echo back
    tud_cdc_write(info.ptr_lin, info.len_lin);
    if(info.len_wrap)
      tud_cdc_write(info.ptr_wrap, info.len_wrap);
    tu_fifo_advance_read_pointer(fifo, info.len_lin + info.len_wrap);
    tud_cdc_n_rx_ff_read_done(0);
    tud_cdc_write_flush();
  }
}

xfer_fifo:

Simulation of edpt_xfer_fifo, in this case there will be no more copy inside cdc_device. In https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/src/class/cdc/cdc_device.c#L438 Replace tu_fifo_write_n(&p_cdc->rx_ff, &p_cdc->epout_buf, xferred_bytes); with tu_fifo_advance_write_pointer(&p_cdc->rx_ff,xferred_bytes);


According to my test do less copy can improve performance and reduce execution time significantly.

@mfp20 do you have any update on your project ?

mfp20 commented 3 years ago

@HiFiPhile which one?

HiFiPhile commented 3 years ago

@HiFiPhile which one?

Last month I saw you had some discussion about DMA and CDC, have you sorted it out ?

mfp20 commented 3 years ago

@HiFiPhile which one?

Last month I saw you had some discussion about DMA and CDC, have you sorted it out ?

No.

I wrote a "dma pump" to allocate DMA channels and use on request of heterogeneous consumers. But performance were not as good as integrating dma in peripheral's functions. Probably working a bit more on that route would bring good results, but it is too much work to circumvent the lack of control over buffers. The easiest path would be to be able to supply your own buffers to peripherals; in the case of TinyUSB use pointers, function ptrs and ptrs to ptrs, instead of allocating a tu_fifo by default, so that developer can then use his own buffer and supply his own buffer access routines to TinyUSB.

hathach commented 3 years ago

@HiFiPhile wow, that is a very great detail, current queuing only 1 transfer per endpoint can also limit the actual throughput. Though expanding it can increase code complexity since there is more mcu that does not support that than one does. But definitely what we could take a look a.

HiFiPhile commented 3 years ago

Though expanding it can increase code complexity since there is more mcu that does not support that than one does. But definitely what we could take a look.

Yes I agree with you, currently 30MB/s is close to the limit, I think it's easier to focus on reduce CPU utilisation.

I'll try to add edpt_xfer_fifo support for CDC class and have few questions @hathach , @PanRe : For OUT xfer: Is the caller's responsibility to check free space (DCD just do the queue) ? For IN xfer: Is the caller's responsibility to ensure there are at least total_bytes elements in fifo ? If there are more elements in fifo, should the length be limited to total_bytes ?

PanRe commented 3 years ago

HI, As for the OUT transfer, i would say since CDC is a supposed to be a reliable connection (Bulk EP), i would set the FIFO to be non-overwriteable, check within the CDC driver if the received frame can be stored within the FIFO entirely and if not don't acknowledge the successful reception such that the host resends the frame again (hoping that later on there will be space available). Within UAC2, i selected the FIFO to be overwriteable and every frame gets acknowledged. This, however, corresponds to the nature of iso streams. So i would say the caller (i guess you mean the end user) is not responsible to check for free space. He just needs to empty the FIFO to get new data.

For IN transfers, i am currently not sure what total_bytes refers to. I guess you mean the total packet size i.e. 8, 16, 32 or 64 bytes for full speed and 512 for high speed devices?

HiFiPhile commented 3 years ago

This, however, corresponds to the nature of iso streams. So i would say the caller (i guess you mean the end user)

Sorry it's not very clear, I mean between class driver and dcd tud_edpt_xfer_fifo implementation. Seems like class drivers check if everything ok before call tud_edpt_xfer_fifo but I'm not sure, I'm also trying to add tud_edpt_xfer_fifo for dcd_transdimension.

For IN transfers, i am currently not sure what total_bytes

I mean how dcd's tud_edpt_xfer_fifo implementation treat the argument of tud_edpt_xfer_fifo. Now I think total_bytes should be sent if there are more elements in the fifo, maybe in some futures classes there is a need to fill the fifo and send slowly.

PanRe commented 3 years ago

Ah ok. Yes i would say it is the class drivers responsibility to check everything and the DCD only queues the ordered number of bytes given by total_bytes. This is the same as the linear buffer functions dcd_edpt_xfer() does. I always would restrict to total_bytes and never send more... Since it is a FIFO, there is a chance that the number of bytes in the FIFO is bigger at the time when the transmit gets executed compared to the point in time where the class driver scheduled the transmit. This might end up in undefined behaviour. For instance, a bulk transfer can be defined to be finished in case a non-max-packet-size packet was transmitted. I know Windows drivers do not use this mode (they need an explicit zero length packet) but i am not sure about others. If there are too less bytes in the FIFO compared to the scheduled total_bytes i think there went something terribly wrong and i would consider this a hard fault. At least i can not think of any such situation. Currently, this case is not checked at all as it was not possible for the linear buffer functions dcd_edpt_xfer() and i did not implement such a check for the FIFO case.

HiFiPhile commented 3 years ago

Maybe tu_fifo_remaining should take account if the fifo is overwritable. Currently there is a bug in CDC when DTR bit is not set. FIFO is set to overwritable in this case, but since tu_fifo_remaining is checked before the transfer, no transfer is queued.

HiFiPhile commented 3 years ago

I've add cdc & dcd_transmission xfer_fifo support and updated the chart, result seems pretty good.

mijnendatum commented 1 year ago

Hi @HiFiPhile, I'm relatively new to TinyUSB and just got it working on the STM32F429 Evaluation board running @180MHz with the external High Speed PHY. I'm using an adaptation of the examples\device\cdc_msc configuration so it just enumerates 1 CDC port and using the simplified cdc_task() from examples\device\cdc_msc .

After using your python script for testing the throughput on my Windows 10 machine, I get very different results from your chart. Without compiler optimization for GCC, I get 1.90MB/s throughput on vanilla code and 2.23 MB/s with 2048 byte buffer sizes. When enabling optimization, it roughly doubles to 4.52 MB/s with the 2048 byte buffers.

Since the STM32F429 MCU is about 24MHz slower, I would assume the throughput would be slower but I find it odd that it is about 5 times slower than what I see in your charts. Am I missing something here?

I use a bare minimum configuration generated with CubeMX for the USB HS interface with DMA enabled but without enabling the Middleware from ST. I adjusted the usb_config.h file to look something like this:


// Board Specific Configuration
//--------------------------------------------------------------------+

// RHPort number used for device can be defined by board.mk, default to port 0
#ifndef BOARD_TUD_RHPORT
#define BOARD_TUD_RHPORT      1
#endif

// RHPort max operational speed can defined by board.mk
#ifndef BOARD_TUD_MAX_SPEED
#define BOARD_TUD_MAX_SPEED   OPT_MODE_HIGH_SPEED
#endif

//--------------------------------------------------------------------
// Common Configuration
//--------------------------------------------------------------------

// defined by compiler flags for flexibility
#ifndef CFG_TUSB_MCU
#define CFG_TUSB_MCU                 OPT_MCU_STM32F4
#define CFG_TUSB_OS                  OPT_OS_NONE
#define BOARD_DEVICE_RHPORT_SPEED    OPT_MODE_HIGH_SPEED  // 480mbps
#define BOARD_DEVICE_RHPORT_NUM     1
#define CFG_TUSB_RHPORT1_MODE       (OPT_MODE_DEVICE | OPT_MODE_HIGH_SPEED)
#endif

#ifndef CFG_TUSB_OS
#define CFG_TUSB_OS           OPT_OS_NONE
#endif

#ifndef CFG_TUSB_DEBUG
#define CFG_TUSB_DEBUG        0
#endif

// Enable Device stack
#define CFG_TUD_ENABLED       1

// Default is max speed that hardware controller could support with on-chip PHY
#define CFG_TUD_MAX_SPEED     BOARD_TUD_MAX_SPEED

/* USB DMA on some MCUs can only access a specific SRAM region with restriction on alignment.
 * Tinyusb use follows macros to declare transferring memory so that they can be put
 * into those specific section.
 * e.g
 * - CFG_TUSB_MEM SECTION : __attribute__ (( section(".usb_ram") ))
 * - CFG_TUSB_MEM_ALIGN   : __attribute__ ((aligned(4)))
 */
#ifndef CFG_TUSB_MEM_SECTION
#define CFG_TUSB_MEM_SECTION
#endif

#ifndef CFG_TUSB_MEM_ALIGN
#define CFG_TUSB_MEM_ALIGN    __attribute__ ((aligned(4)))
#endif

//--------------------------------------------------------------------
// DEVICE CONFIGURATION
//--------------------------------------------------------------------

#ifndef CFG_TUD_ENDPOINT0_SIZE
#define CFG_TUD_ENDPOINT0_SIZE    64
#endif

//------------- CLASS -------------//
#define CFG_TUD_CDC              1
#define CFG_TUD_MSC              0
#define CFG_TUD_HID              0
#define CFG_TUD_MIDI             0
#define CFG_TUD_VENDOR           0

 // CDC FIFO size of TX and RX
 #define CFG_TUD_CDC_RX_BUFSIZE   (TUD_OPT_HIGH_SPEED ? 2048 : 64)
 #define CFG_TUD_CDC_TX_BUFSIZE   (TUD_OPT_HIGH_SPEED ? 2048 : 64)

 // CDC Endpoint transfer buffer size, more is faster
 #define CFG_TUD_CDC_EP_BUFSIZE   (TUD_OPT_HIGH_SPEED ? 2048 : 64)```
HiFiPhile commented 1 year ago

Without compiler optimization for GCC, I get 1.90MB/s throughput on vanilla code and 2.23 MB/s with 2048 byte buffer sizes. When enabling optimization, it roughly doubles to 4.52 MB/s with the 2048 byte buffers.

Indeed these numbers are not good for HS transfer...

I thank it's from both hardware & software.

The Synopsis DWC2 IP inside STM32F4 is less capable than Chipidea HS inside LPC43XX. Chipidea one has embedded DMA who can transfer multiple packets without CPU intervention or buffer copy, while for DWC2 there are much more work needed to do. You can see DWC2 driver is more than 2x larger.

Software side dcd_dwc2 driver hasn't DMA support yet, so even if you enabled it in Cube packets are still read / write by CPU. https://github.com/hathach/tinyusb/blob/aaff27d220dd95cc018bf01f54df5dae706a52ae/src/portable/synopsys/dwc2/dcd_dwc2.c#L955

HiFiPhile commented 1 year ago

Same test (reduced a little) on LPC55S69:

Config Throughput Comment
Vanilla 3.63 MB/s
512B Echo Buf 10.85 MB/s https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115
2048B All Bufs 20.58 MB/s https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115 and tusb_config.h
512B cdc_task2 10.97 MB/s Change buffer size in tusb_config.h and modify cdc_task as below
2048B cdc_task2 20.58 MB/s Change buffer size in tusb_config.h and modify cdc_task as below
512B cdc_task2+ xfer_fifo - No fifo transfer support in DCD
2048B cdc_task2+ xfer_fifo - No fifo transfer support in DCD
4096B cdc_task2+ real xfer_fifo - No fifo transfer support in DCD

Honestly I'm surprised the throughput is much lower than LPC4357 !

Both DCD have built-in DMA, 150MHz Cortex-M33 should perform not much less than 200MHz Cortex-M4.

For dcd_lpc_ip3511 DCD need to use a dedicated RAM region for packet buffer, maybe this RAM region has slow access speed...

mijnendatum commented 1 year ago

I believe enabling DMA in the dcd_dwc2 driver would be beneficial for upping the transfer rates. Trying a throughput test with the standard ST middleware with DMA enabled I can get approximately 16MB/s throughput, as long as the data size is limited.

I'm afraid it will take me a considerable amount of time to get that working. Given the throughput rates you are reporting, I am considering using an NXP MCU in the future though ^^