Open HiFiPhile opened 3 years ago
This is already asked, but can you put up your board & setup and how you test the throughput and its result. This allow others to join and try with their own setup for comparison.
I've done some tests:
examples\device\cdc_dual_ports
, except using simplified cdc_task() from examples\device\cdc_msc
Config | Throughput | tud_task time (us) |
cdc_task time (us) |
Comment |
---|---|---|---|---|
Vanilla | 6.87 MB/s | 6.7-10 | 5.7 | |
512B Echo Buf | 15.54 MB/s | 6.8-9.8 | 7.5 | https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115 |
2048B All Bufs | 25.67 MB/s | 12.8 | 13.6 | https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115 and tusb_config.h |
512B cdc_task2 | 15.53 MB/s | 6.6-10 | 4.8 | Change buffer size in tusb_config.h and modify cdc_task as below |
2048B cdc_task2 | 28.26 MB/s | 12.8-16 | 4.9 | Change buffer size in tusb_config.h and modify cdc_task as below |
512B cdc_task2+ xfer_fifo | 18.30 MB/s | 6.8-8 | 4.8 | Change buffer size in tusb_config.h , edpt_xfer_fifo not implemented yet, just a simulation |
2048B cdc_task2+ xfer_fifo | 30.69 MB/s | 6.8-8 | 4.8 | Change buffer size in tusb_config.h , edpt_xfer_fifo not implemented yet, just a simulation |
4096B cdc_task2+ real xfer_fifo | 34.52 MB/s | 7.2-9 | 5.6 | Minimum 4k buffer is required for dcd_transdimension, changes from https://github.com/HiFiPhile/tinyusb/tree/xfer_fifo_transdimension |
With fifo exported from https://github.com/HiFiPhile/tinyusb/blob/60b41ffc1d35ceb8e7b25b9e6fb5202fe0778984/src/class/cdc/cdc_device.h#L108
void cdc_task2(void)
{
if ( tud_cdc_available() )
{
tu_fifo_t* fifo = tud_cdc_get_rx_ff();
tu_fifo_buffer_info_t info;
tu_fifo_get_read_info(fifo, &info);
// Echo back
tud_cdc_write(info.ptr_lin, info.len_lin);
if(info.len_wrap)
tud_cdc_write(info.ptr_wrap, info.len_wrap);
tu_fifo_advance_read_pointer(fifo, info.len_lin + info.len_wrap);
tud_cdc_n_rx_ff_read_done(0);
tud_cdc_write_flush();
}
}
Simulation of edpt_xfer_fifo, in this case there will be no more copy inside cdc_device
.
In https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/src/class/cdc/cdc_device.c#L438
Replace
tu_fifo_write_n(&p_cdc->rx_ff, &p_cdc->epout_buf, xferred_bytes);
with
tu_fifo_advance_write_pointer(&p_cdc->rx_ff,xferred_bytes);
According to my test do less copy can improve performance and reduce execution time significantly.
@mfp20 do you have any update on your project ?
@HiFiPhile which one?
@HiFiPhile which one?
Last month I saw you had some discussion about DMA and CDC, have you sorted it out ?
@HiFiPhile which one?
Last month I saw you had some discussion about DMA and CDC, have you sorted it out ?
No.
I wrote a "dma pump" to allocate DMA channels and use on request of heterogeneous consumers. But performance were not as good as integrating dma in peripheral's functions. Probably working a bit more on that route would bring good results, but it is too much work to circumvent the lack of control over buffers. The easiest path would be to be able to supply your own buffers to peripherals; in the case of TinyUSB use pointers, function ptrs and ptrs to ptrs, instead of allocating a tu_fifo by default, so that developer can then use his own buffer and supply his own buffer access routines to TinyUSB.
@HiFiPhile wow, that is a very great detail, current queuing only 1 transfer per endpoint can also limit the actual throughput. Though expanding it can increase code complexity since there is more mcu that does not support that than one does. But definitely what we could take a look a.
Though expanding it can increase code complexity since there is more mcu that does not support that than one does. But definitely what we could take a look.
Yes I agree with you, currently 30MB/s is close to the limit, I think it's easier to focus on reduce CPU utilisation.
I'll try to add edpt_xfer_fifo support for CDC class and have few questions @hathach , @PanRe :
For OUT xfer:
Is the caller's responsibility to check free space (DCD just do the queue) ?
For IN xfer:
Is the caller's responsibility to ensure there are at least total_bytes
elements in fifo ?
If there are more elements in fifo, should the length be limited to total_bytes
?
HI, As for the OUT transfer, i would say since CDC is a supposed to be a reliable connection (Bulk EP), i would set the FIFO to be non-overwriteable, check within the CDC driver if the received frame can be stored within the FIFO entirely and if not don't acknowledge the successful reception such that the host resends the frame again (hoping that later on there will be space available). Within UAC2, i selected the FIFO to be overwriteable and every frame gets acknowledged. This, however, corresponds to the nature of iso streams. So i would say the caller (i guess you mean the end user) is not responsible to check for free space. He just needs to empty the FIFO to get new data.
For IN transfers, i am currently not sure what total_bytes
refers to. I guess you mean the total packet size i.e. 8, 16, 32 or 64 bytes for full speed and 512 for high speed devices?
This, however, corresponds to the nature of iso streams. So i would say the caller (i guess you mean the end user)
Sorry it's not very clear, I mean between class driver and dcd tud_edpt_xfer_fifo
implementation. Seems like class drivers check if everything ok before call tud_edpt_xfer_fifo
but I'm not sure, I'm also trying to add tud_edpt_xfer_fifo
for dcd_transdimension.
For IN transfers, i am currently not sure what total_bytes
I mean how dcd's tud_edpt_xfer_fifo
implementation treat the argument of tud_edpt_xfer_fifo
. Now I think total_bytes
should be sent if there are more elements in the fifo, maybe in some futures classes there is a need to fill the fifo and send slowly.
Ah ok. Yes i would say it is the class drivers responsibility to check everything and the DCD only queues the ordered number of bytes given by total_bytes
. This is the same as the linear buffer functions dcd_edpt_xfer()
does. I always would restrict to total_bytes
and never send more... Since it is a FIFO, there is a chance that the number of bytes in the FIFO is bigger at the time when the transmit gets executed compared to the point in time where the class driver scheduled the transmit. This might end up in undefined behaviour. For instance, a bulk transfer can be defined to be finished in case a non-max-packet-size packet was transmitted. I know Windows drivers do not use this mode (they need an explicit zero length packet) but i am not sure about others.
If there are too less bytes in the FIFO compared to the scheduled total_bytes
i think there went something terribly wrong and i would consider this a hard fault. At least i can not think of any such situation. Currently, this case is not checked at all as it was not possible for the linear buffer functions dcd_edpt_xfer()
and i did not implement such a check for the FIFO case.
Maybe tu_fifo_remaining
should take account if the fifo is overwritable.
Currently there is a bug in CDC when DTR bit is not set. FIFO is set to overwritable in this case, but since tu_fifo_remaining
is checked before the transfer, no transfer is queued.
I've add cdc & dcd_transmission xfer_fifo support and updated the chart, result seems pretty good.
Hi @HiFiPhile, I'm relatively new to TinyUSB and just got it working on the STM32F429 Evaluation board running @180MHz with the external High Speed PHY. I'm using an adaptation of the examples\device\cdc_msc
configuration so it just enumerates 1 CDC port and using the simplified cdc_task() from examples\device\cdc_msc .
After using your python script for testing the throughput on my Windows 10 machine, I get very different results from your chart. Without compiler optimization for GCC, I get 1.90MB/s throughput on vanilla code and 2.23 MB/s with 2048 byte buffer sizes. When enabling optimization, it roughly doubles to 4.52 MB/s with the 2048 byte buffers.
Since the STM32F429 MCU is about 24MHz slower, I would assume the throughput would be slower but I find it odd that it is about 5 times slower than what I see in your charts. Am I missing something here?
I use a bare minimum configuration generated with CubeMX for the USB HS interface with DMA enabled but without enabling the Middleware from ST. I adjusted the usb_config.h file to look something like this:
// Board Specific Configuration
//--------------------------------------------------------------------+
// RHPort number used for device can be defined by board.mk, default to port 0
#ifndef BOARD_TUD_RHPORT
#define BOARD_TUD_RHPORT 1
#endif
// RHPort max operational speed can defined by board.mk
#ifndef BOARD_TUD_MAX_SPEED
#define BOARD_TUD_MAX_SPEED OPT_MODE_HIGH_SPEED
#endif
//--------------------------------------------------------------------
// Common Configuration
//--------------------------------------------------------------------
// defined by compiler flags for flexibility
#ifndef CFG_TUSB_MCU
#define CFG_TUSB_MCU OPT_MCU_STM32F4
#define CFG_TUSB_OS OPT_OS_NONE
#define BOARD_DEVICE_RHPORT_SPEED OPT_MODE_HIGH_SPEED // 480mbps
#define BOARD_DEVICE_RHPORT_NUM 1
#define CFG_TUSB_RHPORT1_MODE (OPT_MODE_DEVICE | OPT_MODE_HIGH_SPEED)
#endif
#ifndef CFG_TUSB_OS
#define CFG_TUSB_OS OPT_OS_NONE
#endif
#ifndef CFG_TUSB_DEBUG
#define CFG_TUSB_DEBUG 0
#endif
// Enable Device stack
#define CFG_TUD_ENABLED 1
// Default is max speed that hardware controller could support with on-chip PHY
#define CFG_TUD_MAX_SPEED BOARD_TUD_MAX_SPEED
/* USB DMA on some MCUs can only access a specific SRAM region with restriction on alignment.
* Tinyusb use follows macros to declare transferring memory so that they can be put
* into those specific section.
* e.g
* - CFG_TUSB_MEM SECTION : __attribute__ (( section(".usb_ram") ))
* - CFG_TUSB_MEM_ALIGN : __attribute__ ((aligned(4)))
*/
#ifndef CFG_TUSB_MEM_SECTION
#define CFG_TUSB_MEM_SECTION
#endif
#ifndef CFG_TUSB_MEM_ALIGN
#define CFG_TUSB_MEM_ALIGN __attribute__ ((aligned(4)))
#endif
//--------------------------------------------------------------------
// DEVICE CONFIGURATION
//--------------------------------------------------------------------
#ifndef CFG_TUD_ENDPOINT0_SIZE
#define CFG_TUD_ENDPOINT0_SIZE 64
#endif
//------------- CLASS -------------//
#define CFG_TUD_CDC 1
#define CFG_TUD_MSC 0
#define CFG_TUD_HID 0
#define CFG_TUD_MIDI 0
#define CFG_TUD_VENDOR 0
// CDC FIFO size of TX and RX
#define CFG_TUD_CDC_RX_BUFSIZE (TUD_OPT_HIGH_SPEED ? 2048 : 64)
#define CFG_TUD_CDC_TX_BUFSIZE (TUD_OPT_HIGH_SPEED ? 2048 : 64)
// CDC Endpoint transfer buffer size, more is faster
#define CFG_TUD_CDC_EP_BUFSIZE (TUD_OPT_HIGH_SPEED ? 2048 : 64)```
Without compiler optimization for GCC, I get 1.90MB/s throughput on vanilla code and 2.23 MB/s with 2048 byte buffer sizes. When enabling optimization, it roughly doubles to 4.52 MB/s with the 2048 byte buffers.
Indeed these numbers are not good for HS transfer...
I thank it's from both hardware & software.
The Synopsis DWC2 IP inside STM32F4 is less capable than Chipidea HS inside LPC43XX. Chipidea one has embedded DMA who can transfer multiple packets without CPU intervention or buffer copy, while for DWC2 there are much more work needed to do. You can see DWC2 driver is more than 2x larger.
Software side dcd_dwc2
driver hasn't DMA support yet, so even if you enabled it in Cube packets are still read / write by CPU.
https://github.com/hathach/tinyusb/blob/aaff27d220dd95cc018bf01f54df5dae706a52ae/src/portable/synopsys/dwc2/dcd_dwc2.c#L955
Same test (reduced a little) on LPC55S69:
examples\device\cdc_dual_ports
, except using simplified cdc_task() from examples\device\cdc_msc
Config | Throughput | Comment |
---|---|---|
Vanilla | 3.63 MB/s | |
512B Echo Buf | 10.85 MB/s | https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115 |
2048B All Bufs | 20.58 MB/s | https://github.com/hathach/tinyusb/blob/5a4fc1151a7e62a291a90836165942737a1a2d6f/examples/device/cdc_msc/src/main.c#L115 and tusb_config.h |
512B cdc_task2 | 10.97 MB/s | Change buffer size in tusb_config.h and modify cdc_task as below |
2048B cdc_task2 | 20.58 MB/s | Change buffer size in tusb_config.h and modify cdc_task as below |
512B cdc_task2+ xfer_fifo | - | No fifo transfer support in DCD |
2048B cdc_task2+ xfer_fifo | - | No fifo transfer support in DCD |
4096B cdc_task2+ real xfer_fifo | - | No fifo transfer support in DCD |
Honestly I'm surprised the throughput is much lower than LPC4357 !
Both DCD have built-in DMA, 150MHz Cortex-M33 should perform not much less than 200MHz Cortex-M4.
For dcd_lpc_ip3511
DCD need to use a dedicated RAM region for packet buffer, maybe this RAM region has slow access speed...
I believe enabling DMA in the dcd_dwc2 driver would be beneficial for upping the transfer rates. Trying a throughput test with the standard ST middleware with DMA enabled I can get approximately 16MB/s throughput, as long as the data size is limited.
I'm afraid it will take me a considerable amount of time to get that working. Given the throughput rates you are reporting, I am considering using an NXP MCU in the future though ^^
Is your feature request related to a problem? Please describe.
Raw throughput of USB can be high but once you make some data copy the speed will drop significantly. Currently there is no way to skip fifo & buffer in CDC class:
USB HW Buffer ==> ep_out_buf ==> rx_ff ==> User app
Besides DMA is not supported.We had some discussions on this subject, for example expose the fifo or add xfer_fifo support.
Describe the solution you'd like Sort out a solution to make CDC class capable of high-speed streaming