polarfire-soc / meta-polarfire-soc-yocto-bsp

PolarFire SoC yocto Board Support Package
Other
48 stars 36 forks source link

UDP speed is very low. #54

Open kolabit opened 7 months ago

kolabit commented 7 months ago

Hi As I mentioned in the Issue https://github.com/polarfire-soc/meta-polarfire-soc-yocto-bsp/issues/52 , UDP transfer speed is 300...400Mb/s. At the same time, TCP speed is close to 1Gb/s. I have tried Ubuntu and Yocto, and got the same results. My first test used simple BSD UDP socket that sends 3K buffer from Icicle a dynamic port of my test PC. I got 300-400Mb/s. Tried udmabuf-ddr memory buffer and regular buffer. Second test used kernel UDP socket. You can find the source here: https://github.com/kolabit/kernel_udp_test The speed was about the same - 300-400Mb/s. For the best results, I have maxed the RX/TX buffers, and socket buffer size:

sudo ethtool -G eth1 rx 8192 tx 4096
echo 10485760 > /proc/sys/net/core/wmem_max
echo 10485760 > /proc/sys/net/core/rmem_max

no significant changes.

Also, as I see CPU gets 1 interrupt per each UDP packet. I have tried to enable Interrupt moderation and DMA coalesce, but they are not supported:

ubuntu@ubuntu:~/sources/test_kern_udp$ sudo ethtool -C eth1
netlink error: Operation not supported

Checked macb_main.c driver source and it looks like these operations are not supported, and it ALWAYS sends UDP data by one packet per IRQ.

Any chances to get 1G Tx with UDP with PolarFire SoC ?

vfalanis commented 6 months ago

Hi @kolabit ,

We typically do benchmarks with standard tools, so we will provide some suggestions and benchmarks based on iperf. The issue you are describing seems to be an application/CPU bound issue. You'll need to to leverage multiple cores to maximise the throughput.

You'll need to use a benchmark tool that supports multi-threading. In this case you could use iperf2 or iperf3 v3.16 and onwards. It is worth mentioning that if you intend to use iperf3, you'll need to update Yocto or Buildroot to use a version of iperf that supports multithreading, please find attached a patch you could apply to Yocto to update iperf3 to version 3.16.

Please find below the instructions on how to run iperf to do UDP benchmarks:

On the server side:

$ iperf3 -s -p 5002

On the client side (Icicle Kit):

iperf3 -c <server_ip> -P 4 -u -p 5002 -b 1G

You should see see something like this:

-----------------------------------------------------------
Server listening on 5002
-----------------------------------------------------------
Accepted connection from 10.205.160.81, port 48964
[  5] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 54212
[  6] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 34240
[  9] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 47235
[ 11] local 10.205.160.58 port 5002 connected to 10.205.160.81 port 49467
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec  34.0 MBytes   285 Mbits/sec  0.016 ms  0/24612 (0%)  
[  6]   0.00-1.00   sec  4.49 MBytes  37.6 Mbits/sec  0.083 ms  0/3250 (0%)  
[  9]   0.00-1.00   sec  34.9 MBytes   293 Mbits/sec  0.013 ms  0/25281 (0%)  
[ 11]   0.00-1.00   sec  30.2 MBytes   253 Mbits/sec  0.019 ms  0/21837 (0%)  
[SUM]   0.00-1.00   sec   104 MBytes   869 Mbits/sec  0.033 ms  0/74980 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   1.00-2.00   sec  35.5 MBytes   297 Mbits/sec  0.014 ms  0/25676 (0%)  
[  6]   1.00-2.00   sec  4.61 MBytes  38.7 Mbits/sec  0.057 ms  0/3337 (0%)  
[  9]   1.00-2.00   sec  36.5 MBytes   306 Mbits/sec  0.013 ms  0/26426 (0%)  
[ 11]   1.00-2.00   sec  31.5 MBytes   264 Mbits/sec  0.020 ms  0/22821 (0%)  
[SUM]   1.00-2.00   sec   108 MBytes   907 Mbits/sec  0.026 ms  0/78260 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   2.00-3.00   sec  35.5 MBytes   298 Mbits/sec  0.015 ms  0/25697 (0%)  
[  6]   2.00-3.00   sec  4.65 MBytes  39.0 Mbits/sec  0.092 ms  0/3366 (0%)  
[  9]   2.00-3.00   sec  36.4 MBytes   305 Mbits/sec  0.015 ms  0/26346 (0%)  
[ 11]   2.00-3.00   sec  31.5 MBytes   264 Mbits/sec  0.017 ms  0/22814 (0%)  
[SUM]   2.00-3.00   sec   108 MBytes   906 Mbits/sec  0.035 ms  0/78223 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   3.00-4.00   sec  35.4 MBytes   297 Mbits/sec  0.017 ms  0/25650 (0%)  
[  6]   3.00-4.00   sec  4.73 MBytes  39.7 Mbits/sec  0.079 ms  0/3428 (0%)  
[  9]   3.00-4.00   sec  35.9 MBytes   301 Mbits/sec  0.015 ms  0/25962 (0%)  
[ 11]   3.00-4.00   sec  31.5 MBytes   264 Mbits/sec  0.018 ms  0/22782 (0%)  
[SUM]   3.00-4.00   sec   107 MBytes   902 Mbits/sec  0.032 ms  0/77822 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   4.00-5.00   sec  35.4 MBytes   297 Mbits/sec  0.016 ms  0/25604 (0%)  
[  6]   4.00-5.00   sec  4.70 MBytes  39.4 Mbits/sec  0.085 ms  0/3400 (0%)  
[  9]   4.00-5.00   sec  36.0 MBytes   302 Mbits/sec  0.015 ms  0/26036 (0%)  
[ 11]   4.00-5.00   sec  31.4 MBytes   263 Mbits/sec  0.019 ms  0/22735 (0%)  
[SUM]   4.00-5.00   sec   107 MBytes   901 Mbits/sec  0.034 ms  0/77775 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   5.00-6.00   sec  35.3 MBytes   296 Mbits/sec  0.014 ms  0/25574 (0%)  
[  6]   5.00-6.00   sec  4.79 MBytes  40.2 Mbits/sec  0.064 ms  0/3467 (0%)  
[  9]   5.00-6.00   sec  35.8 MBytes   300 Mbits/sec  0.012 ms  0/25905 (0%)  
[ 11]   5.00-6.00   sec  31.3 MBytes   263 Mbits/sec  0.022 ms  0/22698 (0%)  
[SUM]   5.00-6.00   sec   107 MBytes   899 Mbits/sec  0.028 ms  0/77644 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   6.00-7.00   sec  35.2 MBytes   295 Mbits/sec  0.016 ms  0/25466 (0%)  
[  6]   6.00-7.00   sec  4.95 MBytes  41.5 Mbits/sec  0.048 ms  0/3584 (0%)  
[  9]   6.00-7.00   sec  35.8 MBytes   300 Mbits/sec  0.015 ms  0/25890 (0%)  
[ 11]   6.00-7.00   sec  31.0 MBytes   260 Mbits/sec  0.019 ms  0/22477 (0%)  
[SUM]   6.00-7.00   sec   107 MBytes   897 Mbits/sec  0.024 ms  0/77417 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   7.00-8.00   sec  35.2 MBytes   295 Mbits/sec  0.014 ms  0/25489 (0%)  
[  6]   7.00-8.00   sec  4.90 MBytes  41.1 Mbits/sec  0.077 ms  0/3548 (0%)  
[  9]   7.00-8.00   sec  35.9 MBytes   301 Mbits/sec  0.015 ms  0/25962 (0%)  
[ 11]   7.00-8.00   sec  31.1 MBytes   261 Mbits/sec  0.021 ms  0/22543 (0%)  
[SUM]   7.00-8.00   sec   107 MBytes   898 Mbits/sec  0.032 ms  0/77542 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   8.00-9.00   sec  35.2 MBytes   296 Mbits/sec  0.012 ms  0/25513 (0%)  
[  6]   8.00-9.00   sec  4.89 MBytes  41.0 Mbits/sec  0.063 ms  0/3541 (0%)  
[  9]   8.00-9.00   sec  36.0 MBytes   302 Mbits/sec  0.018 ms  0/26098 (0%)  
[ 11]   8.00-9.00   sec  31.1 MBytes   261 Mbits/sec  0.018 ms  0/22529 (0%)  
[SUM]   8.00-9.00   sec   107 MBytes   900 Mbits/sec  0.028 ms  0/77681 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   9.00-10.00  sec  35.2 MBytes   296 Mbits/sec  0.015 ms  0/25510 (0%)  
[  6]   9.00-10.00  sec  4.90 MBytes  41.1 Mbits/sec  0.049 ms  0/3551 (0%)  
[  9]   9.00-10.00  sec  36.1 MBytes   303 Mbits/sec  0.015 ms  0/26139 (0%)  
[ 11]   9.00-10.00  sec  31.1 MBytes   261 Mbits/sec  0.014 ms  0/22548 (0%)  
[SUM]   9.00-10.00  sec   107 MBytes   901 Mbits/sec  0.023 ms  0/77748 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  10.00-10.04  sec  1.42 MBytes   283 Mbits/sec  0.018 ms  0/1029 (0%)  
[  6]  10.00-10.04  sec   208 KBytes  40.4 Mbits/sec  0.064 ms  0/147 (0%)  
[  9]  10.00-10.04  sec  1.46 MBytes   291 Mbits/sec  0.014 ms  0/1058 (0%)  
[ 11]  10.00-10.04  sec  1.26 MBytes   251 Mbits/sec  0.015 ms  0/912 (0%)  
[SUM]  10.00-10.04  sec  4.34 MBytes   865 Mbits/sec  0.028 ms  0/3146 (0%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.04  sec   353 MBytes   295 Mbits/sec  0.018 ms  0/255820 (0%)  receiver
[  6]   0.00-10.04  sec  47.8 MBytes  39.9 Mbits/sec  0.064 ms  0/34619 (0%)  receiver
[  9]   0.00-10.04  sec   361 MBytes   301 Mbits/sec  0.014 ms  0/261103 (0%)  receiver
[ 11]   0.00-10.04  sec   313 MBytes   262 Mbits/sec  0.015 ms  0/226696 (0%)  receiver
[SUM]   0.00-10.04  sec  1.05 GBytes   898 Mbits/sec  0.028 ms  0/778238 (0%)  receiver

As shown in the last line, the average bitrate was 898 Mbits/sec when using 4 parallel streams, which is very close to the max theorical throughput.

0001-add-local-version-of-iperf-316.patch

kolabit commented 6 months ago

Hi @vfalanis I don't think this approach is acceptable. In this case, we will have 4 CPUs fully busy with sending 1G UDP stream (!!!), which will deprive other tasks of CPU time. From the other hand, we will need to implement sending of the data from the multiple CPUs.

Without real interrupt moderation in UDP, this platform is unusable.

vfalanis commented 5 months ago

Hi @kolabit ,

This is probably less a question of interrupt moderation and more to do with being CPU bound and differences in the UDP and TCP paths through the kernel networking stack.

The quad U54 cores on PolarFire SoC each offer a max clock speed of 625MHz and 1.7DMIPs/MHz.

By comparison, a dual core ARM A9 offers 2.5MHz and runs at up to 1GHz.

Therefore two U54s is roughly equivalent to the CPU performance of the A9 at 1GHz.

Despite having a lower clock speed (625MHz), the RISC-V CPU in PolarFire SoC offers reasonable performance with 1.7 DMIPS/MHz and 2.75 CoreMark/MHz ratings.

Utilizing two or three U54s enables your embedded system to distribute computational tasks more effectively, potentially improving overall throughput and responsiveness. In this way, your system could exploit parallelism and optimize resource usage more efficiently, ultimately maximizing the overall performance of the system.