ironcore-dev / dpservice

DPDK based fast Dataplane / L3 router / SDN enabler, installable on compute nodes / SmartNICs
Apache License 2.0
7 stars 1 forks source link

dp_service may crash with iperf3 high load when traffic is not offloaded #23

Closed guvenc closed 2 years ago

guvenc commented 2 years ago

dp_service may crash when iperf3 sends/receives not offloaded traffic greater than 7 Gbits. This is due to a bug in dpdk mellanox driver documented here: https://bugs.dpdk.org/show_bug.cgi?id=334

Needs an investigation whether a workaround provided by Mellanox can be used.

guvenc commented 2 years ago

The mellanox bug talks about a workaround: The option we are considering - explicitly disable the CQE compression if rx_burst_vec is engaged

So tried to start dp_service by passing the following parameters to mlx5 DPDK driver:

dp_service -l 0,1 -a 03:00.0,class=rxq_cqe_comp_en=0,rx_vec_en=1 -a 03:00.1,class=rxq_cqe_comp_en=0,rx_vec_en=1 -- 

This causes the VFs not to be created in dp_service so this doesn't seem to help.

Relevant documentation: https://doc.dpdk.org/guides/nics/mlx5.html#rx-burst-functions https://doc.dpdk.org/guides/nics/mlx5.html#driver-options

guvenc commented 2 years ago

After a long debug session of the dpdk mlx5 driver, figured out the way how the workaround can be applied to our environment. For ARM-Based Bluefield-2 Cards, which do not really have real VFs in our use case but only one VF-like representor port of the baremetal server. The following needs to be added to dp_service parameters: Special "VF like" representor port of the Bluefield-2 card is indexed in mlx5 driver as "-1"

-a 0000:03:00.0,class=rxq_cqe_comp_en=0,rx_vec_en=1,representor=pf[0]vf[-1]

For Hypervisor use cases on intel platform (Depending on PCI Bus and number of VFs, this line might change)

-a 0000:8a:00.0,class=rxq_cqe_comp_en=0,rx_vec_en=1,representor=pf[0]vf[1] -a 0000:8a:00.1,class=rxq_cqe_comp_en=0,rx_vec_en=1

After applying these parameters to mlx5 driver, the crash was under same conditions not observable anymore.

guvenc commented 2 years ago

Mellanox workaround implemented with #70 This can not be fixed in dp-service.