Open kolabit opened 7 months ago
Hi @kolabit ,
We typically do benchmarks with standard tools, so we will provide some suggestions and benchmarks based on iperf. The issue you are describing seems to be an application/CPU bound issue. You'll need to to leverage multiple cores to maximise the throughput.
You'll need to use a benchmark tool that supports multi-threading. In this case you could use iperf2 or iperf3 v3.16 and onwards. It is worth mentioning that if you intend to use iperf3, you'll need to update Yocto or Buildroot to use a version of iperf that supports multithreading, please find attached a patch you could apply to Yocto to update iperf3 to version 3.16.
Please find below the instructions on how to run iperf to do UDP benchmarks:
On the server side:
$ iperf3 -s -p 5002
On the client side (Icicle Kit):
iperf3 -c <server_ip> -P 4 -u -p 5002 -b 1G
You should see see something like this:
-----------------------------------------------------------
Server listening on 5002
-----------------------------------------------------------
Accepted connection from 10.205.160.81, port 48964
[ 5] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 54212
[ 6] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 34240
[ 9] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 47235
[ 11] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 49467
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-1.00 sec 34.0 MBytes 285 Mbits/sec 0.016 ms 0/24612 (0%)
[ 6] 0.00-1.00 sec 4.49 MBytes 37.6 Mbits/sec 0.083 ms 0/3250 (0%)
[ 9] 0.00-1.00 sec 34.9 MBytes 293 Mbits/sec 0.013 ms 0/25281 (0%)
[ 11] 0.00-1.00 sec 30.2 MBytes 253 Mbits/sec 0.019 ms 0/21837 (0%)
[SUM] 0.00-1.00 sec 104 MBytes 869 Mbits/sec 0.033 ms 0/74980 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 1.00-2.00 sec 35.5 MBytes 297 Mbits/sec 0.014 ms 0/25676 (0%)
[ 6] 1.00-2.00 sec 4.61 MBytes 38.7 Mbits/sec 0.057 ms 0/3337 (0%)
[ 9] 1.00-2.00 sec 36.5 MBytes 306 Mbits/sec 0.013 ms 0/26426 (0%)
[ 11] 1.00-2.00 sec 31.5 MBytes 264 Mbits/sec 0.020 ms 0/22821 (0%)
[SUM] 1.00-2.00 sec 108 MBytes 907 Mbits/sec 0.026 ms 0/78260 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 2.00-3.00 sec 35.5 MBytes 298 Mbits/sec 0.015 ms 0/25697 (0%)
[ 6] 2.00-3.00 sec 4.65 MBytes 39.0 Mbits/sec 0.092 ms 0/3366 (0%)
[ 9] 2.00-3.00 sec 36.4 MBytes 305 Mbits/sec 0.015 ms 0/26346 (0%)
[ 11] 2.00-3.00 sec 31.5 MBytes 264 Mbits/sec 0.017 ms 0/22814 (0%)
[SUM] 2.00-3.00 sec 108 MBytes 906 Mbits/sec 0.035 ms 0/78223 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 3.00-4.00 sec 35.4 MBytes 297 Mbits/sec 0.017 ms 0/25650 (0%)
[ 6] 3.00-4.00 sec 4.73 MBytes 39.7 Mbits/sec 0.079 ms 0/3428 (0%)
[ 9] 3.00-4.00 sec 35.9 MBytes 301 Mbits/sec 0.015 ms 0/25962 (0%)
[ 11] 3.00-4.00 sec 31.5 MBytes 264 Mbits/sec 0.018 ms 0/22782 (0%)
[SUM] 3.00-4.00 sec 107 MBytes 902 Mbits/sec 0.032 ms 0/77822 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 4.00-5.00 sec 35.4 MBytes 297 Mbits/sec 0.016 ms 0/25604 (0%)
[ 6] 4.00-5.00 sec 4.70 MBytes 39.4 Mbits/sec 0.085 ms 0/3400 (0%)
[ 9] 4.00-5.00 sec 36.0 MBytes 302 Mbits/sec 0.015 ms 0/26036 (0%)
[ 11] 4.00-5.00 sec 31.4 MBytes 263 Mbits/sec 0.019 ms 0/22735 (0%)
[SUM] 4.00-5.00 sec 107 MBytes 901 Mbits/sec 0.034 ms 0/77775 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 5.00-6.00 sec 35.3 MBytes 296 Mbits/sec 0.014 ms 0/25574 (0%)
[ 6] 5.00-6.00 sec 4.79 MBytes 40.2 Mbits/sec 0.064 ms 0/3467 (0%)
[ 9] 5.00-6.00 sec 35.8 MBytes 300 Mbits/sec 0.012 ms 0/25905 (0%)
[ 11] 5.00-6.00 sec 31.3 MBytes 263 Mbits/sec 0.022 ms 0/22698 (0%)
[SUM] 5.00-6.00 sec 107 MBytes 899 Mbits/sec 0.028 ms 0/77644 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 6.00-7.00 sec 35.2 MBytes 295 Mbits/sec 0.016 ms 0/25466 (0%)
[ 6] 6.00-7.00 sec 4.95 MBytes 41.5 Mbits/sec 0.048 ms 0/3584 (0%)
[ 9] 6.00-7.00 sec 35.8 MBytes 300 Mbits/sec 0.015 ms 0/25890 (0%)
[ 11] 6.00-7.00 sec 31.0 MBytes 260 Mbits/sec 0.019 ms 0/22477 (0%)
[SUM] 6.00-7.00 sec 107 MBytes 897 Mbits/sec 0.024 ms 0/77417 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 7.00-8.00 sec 35.2 MBytes 295 Mbits/sec 0.014 ms 0/25489 (0%)
[ 6] 7.00-8.00 sec 4.90 MBytes 41.1 Mbits/sec 0.077 ms 0/3548 (0%)
[ 9] 7.00-8.00 sec 35.9 MBytes 301 Mbits/sec 0.015 ms 0/25962 (0%)
[ 11] 7.00-8.00 sec 31.1 MBytes 261 Mbits/sec 0.021 ms 0/22543 (0%)
[SUM] 7.00-8.00 sec 107 MBytes 898 Mbits/sec 0.032 ms 0/77542 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 8.00-9.00 sec 35.2 MBytes 296 Mbits/sec 0.012 ms 0/25513 (0%)
[ 6] 8.00-9.00 sec 4.89 MBytes 41.0 Mbits/sec 0.063 ms 0/3541 (0%)
[ 9] 8.00-9.00 sec 36.0 MBytes 302 Mbits/sec 0.018 ms 0/26098 (0%)
[ 11] 8.00-9.00 sec 31.1 MBytes 261 Mbits/sec 0.018 ms 0/22529 (0%)
[SUM] 8.00-9.00 sec 107 MBytes 900 Mbits/sec 0.028 ms 0/77681 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 9.00-10.00 sec 35.2 MBytes 296 Mbits/sec 0.015 ms 0/25510 (0%)
[ 6] 9.00-10.00 sec 4.90 MBytes 41.1 Mbits/sec 0.049 ms 0/3551 (0%)
[ 9] 9.00-10.00 sec 36.1 MBytes 303 Mbits/sec 0.015 ms 0/26139 (0%)
[ 11] 9.00-10.00 sec 31.1 MBytes 261 Mbits/sec 0.014 ms 0/22548 (0%)
[SUM] 9.00-10.00 sec 107 MBytes 901 Mbits/sec 0.023 ms 0/77748 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ 5] 10.00-10.04 sec 1.42 MBytes 283 Mbits/sec 0.018 ms 0/1029 (0%)
[ 6] 10.00-10.04 sec 208 KBytes 40.4 Mbits/sec 0.064 ms 0/147 (0%)
[ 9] 10.00-10.04 sec 1.46 MBytes 291 Mbits/sec 0.014 ms 0/1058 (0%)
[ 11] 10.00-10.04 sec 1.26 MBytes 251 Mbits/sec 0.015 ms 0/912 (0%)
[SUM] 10.00-10.04 sec 4.34 MBytes 865 Mbits/sec 0.028 ms 0/3146 (0%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-10.04 sec 353 MBytes 295 Mbits/sec 0.018 ms 0/255820 (0%) receiver
[ 6] 0.00-10.04 sec 47.8 MBytes 39.9 Mbits/sec 0.064 ms 0/34619 (0%) receiver
[ 9] 0.00-10.04 sec 361 MBytes 301 Mbits/sec 0.014 ms 0/261103 (0%) receiver
[ 11] 0.00-10.04 sec 313 MBytes 262 Mbits/sec 0.015 ms 0/226696 (0%) receiver
[SUM] 0.00-10.04 sec 1.05 GBytes 898 Mbits/sec 0.028 ms 0/778238 (0%) receiver
As shown in the last line, the average bitrate was 898 Mbits/sec when using 4 parallel streams, which is very close to the max theorical throughput.
Hi @vfalanis I don't think this approach is acceptable. In this case, we will have 4 CPUs fully busy with sending 1G UDP stream (!!!), which will deprive other tasks of CPU time. From the other hand, we will need to implement sending of the data from the multiple CPUs.
Without real interrupt moderation in UDP, this platform is unusable.
Hi @kolabit ,
This is probably less a question of interrupt moderation and more to do with being CPU bound and differences in the UDP and TCP paths through the kernel networking stack.
The quad U54 cores on PolarFire SoC each offer a max clock speed of 625MHz and 1.7DMIPs/MHz.
By comparison, a dual core ARM A9 offers 2.5MHz and runs at up to 1GHz.
Therefore two U54s is roughly equivalent to the CPU performance of the A9 at 1GHz.
Despite having a lower clock speed (625MHz), the RISC-V CPU in PolarFire SoC offers reasonable performance with 1.7 DMIPS/MHz and 2.75 CoreMark/MHz ratings.
Utilizing two or three U54s enables your embedded system to distribute computational tasks more effectively, potentially improving overall throughput and responsiveness. In this way, your system could exploit parallelism and optimize resource usage more efficiently, ultimately maximizing the overall performance of the system.
Hi As I mentioned in the Issue https://github.com/polarfire-soc/meta-polarfire-soc-yocto-bsp/issues/52 , UDP transfer speed is 300...400Mb/s. At the same time, TCP speed is close to 1Gb/s. I have tried Ubuntu and Yocto, and got the same results. My first test used simple BSD UDP socket that sends 3K buffer from Icicle a dynamic port of my test PC. I got 300-400Mb/s. Tried udmabuf-ddr memory buffer and regular buffer. Second test used kernel UDP socket. You can find the source here: https://github.com/kolabit/kernel_udp_test The speed was about the same - 300-400Mb/s. For the best results, I have maxed the RX/TX buffers, and socket buffer size:
no significant changes.
Also, as I see CPU gets 1 interrupt per each UDP packet. I have tried to enable Interrupt moderation and DMA coalesce, but they are not supported:
Checked macb_main.c driver source and it looks like these operations are not supported, and it ALWAYS sends UDP data by one packet per IRQ.
Any chances to get 1G Tx with UDP with PolarFire SoC ?