After using the new version, there are errors with Intel's 82599 and E810 network cards, while Mellanox network cards are functioning normally.

openvswitch / ovs-issues

Issue tracker repo for Open vSwitch

10 stars 3 forks source link

After using the new version, there are errors with Intel's 82599 and E810 network cards, while Mellanox network cards are functioning normally. #321

Open wangjun0728 opened 6 months ago

wangjun0728 commented 6 months ago

The DPDK version is 22.11. Currently, it appears that the DPDK errors are occurring due to the new version's checksum offload. Mellanox network cards seem to be operating normally. However, both E810 and 82599 network cards are displaying different error messages.

E810： {bus_info="bus_name=pci, vendor_id=8086, device_id=159b", driver_name=net_ice, if_descr="DPDK 22.11.1 net_ice", if_type="6", link_speed="25Gbps", max_hash_mac_addrs="0", max_mac_addrs="64", max_rx_pktlen="1618", max_rx_queues="256", max_tx_queues="256", max_vfs="0", max_vmdq_pools="0", min_rx_bufsize="1024", n_rxq="2", n_txq="5", numa_id="1", port_no="1", rx-steering=rss, rx_csum_offload="true", tx_geneve_tso_offload="false", tx_ip_csum_offload="true", tx_out_ip_csum_offload="true", tx_out_udp_csum_offload="true", tx_sctp_csum_offload="true", tx_tcp_csum_offload="true", tx_tcp_seg_offload="false", tx_udp_csum_offload="true", tx_vxlan_tso_offload="false"}

82599： {bus_info="bus_name=pci, vendor_id=8086, device_id=10fb", driver_name=net_ixgbe, if_descr="DPDK 22.11.1 net_ixgbe", if_type="6", link_speed="10Gbps", max_hash_mac_addrs="4096", max_mac_addrs="127", max_rx_pktlen="1618", max_rx_queues="128", max_tx_queues="64", max_vfs="0", max_vmdq_pools="64", min_rx_bufsize="1024", n_rxq="2", n_txq="5", numa_id="0", port_no="1", rx-steering=rss, rx_csum_offload="true", tx_geneve_tso_offload="false", tx_ip_csum_offload="true", tx_out_ip_csum_offload="false", tx_out_udp_csum_offload="false", tx_sctp_csum_offload="true", tx_tcp_csum_offload="true", tx_tcp_seg_offload="false", tx_udp_csum_offload="true", tx_vxlan_tso_offload="false"

mellanox： {bus_info="bus_name=pci, vendor_id=15b3, device_id=1017", driver_name=mlx5_pci, if_descr="DPDK 22.11.1 mlx5_pci", if_type="6", link_speed="25Gbps", max_hash_mac_addrs="0", max_mac_addrs="128", max_rx_pktlen="1618", max_rx_queues="1024", max_tx_queues="1024", max_vfs="0", max_vmdq_pools="0", min_rx_bufsize="32", n_rxq="2", n_txq="5", numa_id="3", port_no="1", rx-steering=rss, rx_csum_offload="true", tx_geneve_tso_offload="false", tx_ip_csum_offload="true", tx_out_ip_csum_offload="true", tx_out_udp_csum_offload="false", tx_sctp_csum_offload="false", tx_tcp_csum_offload="true", tx_tcp_seg_offload="false", tx_udp_csum_offload="true", tx_vxlan_tso_offload="false"}

igsilya commented 5 months ago

@wangjun0728 thanks for the info. Though you're using lb_output action, so should not experience the same issue as in the thread.

wangjun0728 commented 5 months ago

@igsilya When TSO is enabled, I previously captured TCP exchange packets at the receiving end, but the TCP traffic couldn't be pushed up. It seems there might be an error in calculating TCP segment packets during the interaction. Perhaps this information could be somewhat useful. 10.0.0.3（send）--------》10.0.0.5（receive）

wangjun0728 commented 5 months ago

Regarding the abnormal TCP forwarding after enabling TSO, I still have doubts. Therefore, I added some print messages in netdev_send. The results for physical and vhost interfaces are as follows. Since dp_packet_hwol_is_tso() always returns 0 and the condition ! (netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO) is used here, we should not enter the process after enabling TSO. This is the reason why I did not obtain the statistics information for netdev_geneve_tso_drops. I am not very clear about the reason for this judgment condition.

https://github.com/openvswitch/ovs/blob/master/lib/netdev.c#L913

  vhost-user-client: userspace_tso_enabled():1,
                                netdev_flags:0x1f,
                                dp_packet_hwol_is_tso():0,
                                dp_packet_hwol_is_tunnel_vxlan():0,
                                dp_packet_hwol_is_tunnel_geneve():0

  tun_port_p0: userspace_tso_enabled():1,
               netdev_flags:0x9f,
               dp_packet_hwol_is_tso():0,
               dp_packet_hwol_is_tunnel_vxlan():0,
               dp_packet_hwol_is_tunnel_geneve():1

igsilya commented 5 months ago

So, the packet is not marked for TSO. I wonder if it's just an MTU issue. What are the MTU values configured inside the VM and on the physical ports in OVS?

wangjun0728 commented 5 months ago

VM and vhost-user-client MTU=1500，physical ports MTU=1600.

david-marchand commented 5 months ago

From what I can see in the DPDK code, it appears that the i40e driver does not handle the outer UDP checksum logic. https://github.com/DPDK/dpdk/blob/main/drivers/net/i40e/i40e_rxtx.c#L301

Outer checksum is handled while filling the tunnel parts of the tx descriptor. https://github.com/DPDK/dpdk/blob/main/drivers/net/i40e/i40e_rxtx.c?commit=v24.03-rc3#L253

I don't have a i40e nic available and I am testing on a ice nic. But seeing how close the drivers are, I would expect the issues are shared. I posted a fix on the mailing list.

wangjun0728 commented 5 months ago

@david-marchand Hi, I have validated your V2 version, but unfortunately, with the X710 NIC, the outer UDP checksum is still incorrect. I also tested with the 82559 and CX5 NICs, and it appears to be working fine. I believe this is directly related to the fact that the DPDK driver actually does not support it, despite indicating the capability to support it. https://patchwork.ozlabs.org/project/openvswitch/patch/20240328091537.1467676-1-david.marchand@redhat.com/

  14:02:44.226059 6c:fe:54:2f:7e:b0 > 40:a6:b7:21:92:8c, ethertype 802.1Q (0x8100), length 128: vlan 92, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110)
      10.253.38.55.33869 > 10.253.38.54.geneve: [bad udp cksum 0xffff -> 0x8ed5!] Geneve, Flags [C], vni 0x31, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00050006]
      0e:a0:1b:9e:ca:04 > 0a:c8:e1:5c:84:0e, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 52622, offset 0, flags [none], proto TCP (6), length 52)
      10.0.0.3.48990 > 10.0.0.5.targus-getdata1: Flags [S], cksum 0x4a82 (correct), seq 3476910759, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
  14:02:45.226847 6c:fe:54:2f:7e:b0 > 40:a6:b7:21:92:8c, ethertype 802.1Q (0x8100), length 128: vlan 92, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110)
      10.253.38.55.33869 > 10.253.38.54.geneve: [bad udp cksum 0xffff -> 0x8ed5!] Geneve, Flags [C], vni 0x31, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00050006]
      0e:a0:1b:9e:ca:04 > 0a:c8:e1:5c:84:0e, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 52623, offset 0, flags [none], proto TCP (6), length 52)
      10.0.0.3.48990 > 10.0.0.5.targus-getdata1: Flags [S], cksum 0x4a82 (correct), seq 3476910759, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0

Below are the status of different network cards.

X710：

  {bus_info="bus_name=pci, vendor_id=8086, device_id=1572", driver_name=net_i40e, if_descr="DPDK 23.11.0 net_i40e", if_type="6", link_speed="10Gbps", max_hash_mac_addrs="0", max_mac_addrs="64", max_rx_pktlen="1618", max_rx_queues="320", max_tx_queues="320", max_vfs="0", max_vmdq_pools="64", min_rx_bufsize="1024", n_rxq="2", n_txq="5", numa_id="0", port_no="0", rx-steering=rss, rx_csum_offload="true", tx_geneve_tso_offload="false", tx_ip_csum_offload="true", tx_out_ip_csum_offload="true", tx_out_udp_csum_offload="true", tx_sctp_csum_offload="true", tx_tcp_csum_offload="true", tx_tcp_seg_offload="false", tx_udp_csum_offload="true", tx_vxlan_tso_offload="false"}

CX5：

  {bus_info="bus_name=pci, vendor_id=15b3, device_id=1017", driver_name=mlx5_pci, if_descr="DPDK 23.11.0 mlx5_pci", if_type="6", link_speed="25Gbps", max_hash_mac_addrs="0", max_mac_addrs="128", max_rx_pktlen="1618", max_rx_queues="1024", max_tx_queues="1024", max_vfs="0", max_vmdq_pools="0", min_rx_bufsize="32", n_rxq="2", n_txq="5", numa_id="3", port_no="1", rx-steering=rss, rx_csum_offload="true", tx_geneve_tso_offload="false", tx_ip_csum_offload="true", tx_out_ip_csum_offload="true", tx_out_udp_csum_offload="false", tx_sctp_csum_offload="false", tx_tcp_csum_offload="true", tx_tcp_seg_offload="false", tx_udp_csum_offload="true", tx_vxlan_tso_offload="false"}

82599:

  {bus_info="bus_name=pci, vendor_id=8086, device_id=10fb", driver_name=net_ixgbe, if_descr="DPDK 23.11.0 net_ixgbe", if_type="6", link_speed="10Gbps", max_hash_mac_addrs="4096", max_mac_addrs="127", max_rx_pktlen="1618", max_rx_queues="128", max_tx_queues="64", max_vfs="0", max_vmdq_pools="64", min_rx_bufsize="1024", n_rxq="2", n_txq="5", numa_id="0", port_no="1", rx-steering=rss, rx_csum_offload="true", tx_geneve_tso_offload="false", tx_ip_csum_offload="true", tx_out_ip_csum_offload="false", tx_out_udp_csum_offload="false", tx_sctp_csum_offload="true", tx_tcp_csum_offload="true", tx_tcp_seg_offload="false", tx_udp_csum_offload="true", tx_vxlan_tso_offload="false"}

igsilya commented 5 months ago

@david-marchand I think I agree with @wangjun0728 on this one. The main indicator for me is that rte_net_intel_cksum_flags_prepare doesn't check the outer UDP flag and it also doesn't touch the checksum field in the outer UDP header. And we can see in the dump that it stays at 0xffff value that OVS puts there.

One potential workaround would be for OVS to zero-out the UDP checksum, but I don't think we should, because all-zero checksum may be interpreted as no checksum and just never be calculated while users explicitly request it to be calculated.

wangjun0728 commented 5 months ago

From what I can see in the DPDK code, it appears that the i40e driver does not handle the outer UDP checksum logic. https://github.com/DPDK/dpdk/blob/main/drivers/net/i40e/i40e_rxtx.c#L301

Outer checksum is handled while filling the tunnel parts of the tx descriptor. https://github.com/DPDK/dpdk/blob/main/drivers/net/i40e/i40e_rxtx.c?commit=v24.03-rc3#L253

I don't have a i40e nic available and I am testing on a ice nic. But seeing how close the drivers are, I would expect the issues are shared. I posted a fix on the mailing list.

Hi @david-marchand , regarding the implementation of dpdk drivers, I think you are right. However, I compared the implementations of ice and i40e and found that neither of them has implemented outer udp checksum offload. But I observed that the hns3 driver seems to have implemented it. I'm not sure if this information is helpful.

ice: https://github.com/DPDK/dpdk/blob/main/drivers/net/ice/ice_rxtx.c#L2702 i40e: https://github.com/DPDK/dpdk/blob/main/drivers/net/i40e/i40e_rxtx.c#L254

hns3: https://github.com/DPDK/dpdk/blob/main/drivers/net/hns3/hns3_rxtx.c#L3435

david-marchand commented 5 months ago

Hi @wangjun0728, yes, I noticed it yesterday after more digging and reading the X7xx and E8xx datasheets.

Looking at dpdk history, a faulty fix introduced outer udp checksum in net/i40e (see 8cc79a1636cd ("net/i40e: fix forward outer IPv6 VXLAN")).

I found some bits in the i40e base driver for X722 model that may support outer udp checksum. I pasted a (on the principle) patch for dpdk below. It is incomplete, as I suspect the outer udp checksum support should be adjusted per model in the net/i40e dpdk driver.

Maybe you can have a try with your setup if you have such nics (on a different server maybe, as I see pci ids for a XL710 in your traces)? If this still does not work, we will put all the details in the dpdk bz and let Intel handle the issue (and probably the dpdk commit 8cc79a1636cd should be reverted).

On the OVS side, we still need the fix I sent for net/ice at least.

Tentative fix for dpdk:


diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 5d25ab4d3a..a385444982 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -295,6 +295,15 @@ i40e_parse_tunneling_params(uint64_t ol_flags,
         */
        *cd_tunneling |= (tx_offload.l2_len >> 1) <<
                I40E_TXD_CTX_QW0_NATLEN_SHIFT;
+
+       /**
+        * Calculate the tunneling UDP checksum.
+        * Shall be set only if L4TUNT = 01b and EIPT is not zero
+        */
+       if (!(*cd_tunneling & I40E_TX_CTX_EXT_IP_NONE) &&
+               (*cd_tunneling & I40E_TXD_CTX_UDP_TUNNELING) &&
+               (ol_flags & RTE_MBUF_F_TX_OUTER_UDP_CKSUM))
+               *cd_tunneling |= I40E_TXD_CTX_QW0_L4T_CS_MASK;
 }

 static inline void

wangjun0728 commented 5 months ago

@david-marchand Thank you very much for the modifications. I applied your DPDK i40e modifications and OVS used the V2 modifications. However, I still encounter the issue of incorrect outer UDP checksum. After reviewing your modifications, I think the reason might be related to the fact that tx_geneve_tso_offload/tx_vxlan_tso_offload on X710 is false. Because currently, with them being false, your modifications should not have any effect.

  09:07:58.619283 6c:fe:54:2f:7e:b0 > 40:a6:b7:21:92:8c, ethertype 802.1Q (0x8100), length 128: vlan 92, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110)
      10.253.38.55.47627 > 10.253.38.54.geneve: [bad udp cksum 0xffff -> 0x5917!] Geneve, Flags [C], vni 0x31, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00050006]
          0e:a0:1b:9e:ca:04 > 0a:c8:e1:5c:84:0e, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 48293, offset 0, flags [none], proto TCP (6), length 52)
      10.0.0.3.33372 > 10.0.0.5.targus-getdata1: Flags [S], cksum 0x76c1 (correct), seq 548024830, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0

wangjun0728 commented 5 months ago

After attempting to enable TSO, tx_geneve_tso_offload/tx_vxlan_tso_offload are already set to true. However, upon packet capture, the outer UDP checksum still appears to be incorrect.

  {bus_info="bus_name=pci, vendor_id=8086, device_id=1572", driver_name=net_i40e, if_descr="DPDK 23.11.0 net_i40e", if_type="6", link_speed="10Gbps", max_hash_mac_addrs="0", max_mac_addrs="64", max_rx_pktlen="1618", max_rx_queues="320", max_tx_queues="320", max_vfs="0", max_vmdq_pools="64", min_rx_bufsize="1024", n_rxq="2", n_txq="5", numa_id="0", port_no="0", rx-steering=rss, rx_csum_offload="true", tx_geneve_tso_offload="true", tx_ip_csum_offload="true", tx_out_ip_csum_offload="true", tx_out_udp_csum_offload="true", tx_sctp_csum_offload="true", tx_tcp_csum_offload="true", tx_tcp_seg_offload="true", tx_udp_csum_offload="true", tx_vxlan_tso_offload="true"}

  09:23:29.649394 6c:fe:54:2f:7e:b0 > 40:a6:b7:21:92:8c, ethertype 802.1Q (0x8100), length 128: vlan 92, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110)
      10.253.38.55.51191 > 10.253.38.54.geneve: [bad udp cksum 0xffff -> 0x4b2b!] Geneve, Flags [C], vni 0x31, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00050006]
          0e:a0:1b:9e:ca:04 > 0a:c8:e1:5c:84:0e, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 27195, offset 0, flags [none], proto TCP (6), length 52)
      10.0.0.3.48558 > 10.0.0.5.targus-getdata1: Flags [S], cksum 0xec33 (correct), seq 68066773, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0

wangjun0728 commented 5 months ago

@david-marchand According to Intel(R) Ethernet Controller X710/ XXV710/XL710 Datasheet, section 8.4.4.2, "Tunneling UDP headers and GRE header are not offloaded while the X710/XXV710/XL710 leaves their checksum field as is".So for the X710 series, DPDK should not advertise support for outer UDP offload, which appears to be a bug in DPDK.

https://cdrdv2-public.intel.com/332464/332464_710_Series_Datasheet_v_4_1.pdf

According to Intel(R) Ethernet Controller E810 Datasheet,section 10.5.8.3, The tunneling UDP checksum offload appears to be supported.

https://www.intel.com/content/www/us/en/content-details/613875/intel-ethernet-controller-e810-datasheet.html

david-marchand commented 5 months ago

@wangjun0728 yes, I had found out about this difference reading the datasheets.

I spent some time on the topic. Could you have a try with this OVS branch of mine? https://github.com/david-marchand/ovs/commits/tunnel_offloading_fix

wangjun0728 commented 5 months ago

@david-marchand Thank you ，I have completed the verification of your modifications, including the changes in the DPDK bugzilla and the OVS branch. I believe they are correct.

https://github.com/david-marchand/ovs/commits/tunnel_offloading_fix https://bugs.dpdk.org/show_bug.cgi?id=1406

1、First of all, regarding the X710, as mentioned in our discussion in the DPDK bugzilla, it indeed does not support outer UDP offloading. However, DPDK falsely indicates support for it. Therefore, it is necessary to disable it on the OVS side. I attempted to enable it in your version, which resulted in outer checksum errors. Hence, we will keep it disabled.

2、Secondly, regarding the E810, I have validated your modifications, and they have effectively resolved our issue. The outer checksum is correct, and I have observed that it has addressed the error message "ice_interrupt_handler():OICR: MDD event,"which is excellent.

3、Furthermore, I have also verified that the 82599 and Mellanox CX5 network cards are functioning properly. However, I do not have a network card with the net/iavf driver, so I am unable to verify this. I apologize for this inconvenience.

4、Finally, there is one more issue. As I mentioned before, enabling userspace-tso-enable="true" results in very low communication traffic. I believe this is related to tx_tcp_seg_offload="true", which is likely another issue. This problem exists on 82599/X710/E810/CX5, indicating it might be a bug on the OVS side.

CC @igsilya

david-marchand commented 5 months ago

iavf PCI devices are VFs of a E810 or a X710 nic.

The PF PCI device must be bound to the kernel driver and then you can create one VF.

Something like (it works the same with a X710 nic):

# ovs-vsctl del-port dpdk0
# systemctl stop openvswitch

# lspci | grep 04:.*Ethernet
04:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
04:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)

# driverctl list-overrides
0000:04:00.0 vfio-pci
# driverctl unset-override 0000:04:00.0

# echo 1 > /sys/bus/pci/devices/0000:04:00.0/sriov_numvfs

# lspci | grep 04:.*Ethernet
04:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
04:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
04:01.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)

# driverctl set-override 0000:04:01.0 vfio-pci
# driverctl list-overrides
0000:04:01.0 vfio-pci

### This part is optional, but, for quick testing, it helps to enable the trust mode and disable spoof check
# ip link set  enp4s0f0 vf 0 spoochk off trust on

After this, you can use the VF PCI device id as a netdev-dpdk port in OVS.

wangjun0728 commented 5 months ago

iavf PCI devices are VFs of a E810 or a X710 nic.

The PF PCI device must be bound to the kernel driver and then you can create one VF.

Something like (it works the same with a X710 nic):

# ovs-vsctl del-port dpdk0
# systemctl stop openvswitch

# lspci | grep 04:.*Ethernet
04:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
04:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)

# driverctl list-overrides
0000:04:00.0 vfio-pci
# driverctl unset-override 0000:04:00.0

# echo 1 > /sys/bus/pci/devices/0000:04:00.0/sriov_numvfs

# lspci | grep 04:.*Ethernet
04:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
04:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
04:01.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)

# driverctl set-override 0000:04:01.0 vfio-pci
# driverctl list-overrides
0000:04:01.0 vfio-pci

### This part is optional, but, for quick testing, it helps to enable the trust mode and disable spoof check
# ip link set  enp4s0f0 vf 0 spoochk off trust on

After this, you can use the VF PCI device id as a netdev-dpdk port in OVS.

I understand what you mean, but this is a shared environment, and I can't make significant changes. I realize that this involves rebinding new VF network cards, and any changes will have some impact. I'll look for an opportunity to validate the VF scenario later on.

david-marchand commented 5 months ago

Ok, well about iavf, I tested with my E810 setup and it seems to work fine. People from Intel are supposed to test on the dpdk side, hopefully this should be enough.
Wrt to your 4) point, I did a TSO test over a vxlan tunnel, and I got a better throughput than without TSO. The only catch was that I had to update my MTU (which I decreased to 1400).

Mmm, depending on your setup, you may have to double check GRO (I tested with and without, btw) on the receiving host... ?

wangjun0728 commented 5 months ago

Regarding the MTU, I believe it should be correct. The MTU for VM and vhost-user-client is set to 1500, while the MTU for physical ports is set to 1600. Since the Geneve encapsulation only adds a length of 58 bytes, there should be no issue.

wangjun0728 commented 5 months ago

Ok, well about iavf, I tested with my E810 setup and it seems to work fine. People from Intel are supposed to test on the dpdk side, hopefully this should be enough.

Wrt to your 4) point, I did a TSO test over a vxlan tunnel, and I got a better throughput than without TSO. The only catch was that I had to update my MTU (which I decreased to 1400). Mmm, depending on your setup, you may have to double check GRO (I tested with and without, btw) on the receiving host... ?

Thanks for your reply. I tried to close GRO as you said, but the problem still exists. I tried some other packet sending paths and got some information. When I sent the traffic from the virtual machine to the network under the cloud, I observed that the traffic was normal. But there will be problems when my traffic path is from virtual machine to virtual machine. I think this may be helpful, and this should be my next step of analysis. And if you use udp to send packets, it is normal.

 1、        VM (node1) -------->node2（physical network）            normal
 2、        VM (node1) -------->VM(node2)                            abnormal

david-marchand commented 5 months ago

Testing vm to vm (on a different setup with a IPv4 vxlan tunnel, with mlx5 nic) and setting mtu to 1600 on the dpdk physical port (and leaving all other mtu untouched at default 1500), I see in the vm:

# iperf3 -c 172.31.2.1 -t 1
Connecting to host 172.31.2.1, port 5201
[  5] local 172.31.2.2 port 50422 connected to 172.31.2.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  84.8 KBytes   694 Kbits/sec   25   1.41 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-1.00   sec  84.8 KBytes   694 Kbits/sec   25             sender
[  5]   0.00-1.04   sec  17.0 KBytes   134 Kbits/sec                  receiver

On the "receiving" node, packets are dropped in OVS with the follow logs:

2024-04-08T13:34:00.969Z|00002|netdev_dpdk(pmd-c03/id:7)|WARN|vhost0: Too big size 1564 max_packet_len 1518
2024-04-08T13:34:00.969Z|00003|netdev_dpdk(pmd-c03/id:7)|WARN|vhost0: Too big size 1564 max_packet_len 1518
2024-04-08T13:34:00.969Z|00004|netdev_dpdk(pmd-c03/id:7)|WARN|vhost0: Too big size 1564 max_packet_len 1518

This is because the nic segments packets with the 1600 mtu. On the receiving OVS side, those packets are too large from the vhost port pov. Adjusting the physical mtu to exactly 1546 (to accomodate the vxlan tunnel) resolves this for me.

Hope that helps.

# iperf3 -c 172.31.2.1 -t 1
Connecting to host 172.31.2.1, port 5201
[  5] local 172.31.2.2 port 33648 connected to 172.31.2.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.08 GBytes  9.30 Gbits/sec  226   2.71 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-1.00   sec  1.08 GBytes  9.30 Gbits/sec  226             sender
[  5]   0.00-1.04   sec  1.08 GBytes  8.93 Gbits/sec                  receiver

wangjun0728 commented 5 months ago

@david-marchand Yes, you are absolutely right. I modified the corresponding mtu (1558) as you said and he did solve my problem, I tested perfectly fine on the E810 network card, which is fantastic. But I have a doubt, here the physical port can only be set to 1500 + 58 (overlay length?). If the mtu of the vhost-user port changes, does the mtu of the physical port also need to be changed?

    # iperf3 -c 10.0.0.5
    Connecting to host 10.0.0.5, port 5201
    [  5] local 10.0.0.3 port 54740 connected to 10.0.0.5 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  1.04 GBytes  8.97 Gbits/sec  2451    552 KBytes       
    [  5]   1.00-2.00   sec  1.04 GBytes  8.92 Gbits/sec  2565    801 KBytes       
    [  5]   2.00-3.00   sec  1.05 GBytes  9.02 Gbits/sec  1558   1.03 MBytes       
    [  5]   3.00-4.00   sec  1.05 GBytes  8.99 Gbits/sec  2229   1.32 MBytes       
    [  5]   4.00-5.00   sec  1.06 GBytes  9.12 Gbits/sec    0   1.83 MBytes

david-marchand commented 5 months ago

If you modify some interface MTU, then the MTU configured of all the ports in your OVS bridges and guests must be reevaluated.

There is no automatic adjustment if the MTU is changed in the guest: neither on the OVS physical ports, nor on the OVS vhost-user ports associated to this guest.

igsilya commented 5 months ago

Uff, that's annoying. But I agree that there is no easy way out from this situation. MTU of the physical port should match exactly the MTU of other ports + encapsulation overhead. I missed that when I first asked about MTU. It also means that MTU of all the VM ports should likely be the same across the cluster. Otherwise two VMs on different nodes may not be able to talk to each other.

In a normal network we would have some sort of path MTU discovery, i.e. interfaces will generate ICMP errors back to the sender in case they can't transmit a packet of a certain size and sender will install a routing exception to send smaller packets to a particular destination. But with DPDK and userpsace datapath we don't have that.

igsilya commented 5 months ago

Wait a second... Why doesn't tso_segsz solve this issue for us? If we receive a TSO packet from vhost-user it should have TSO segment size set according to 1500 MTU of the vhost-user interface. And later the physical NIC should segment the packet according to this segment size value. So, segments received on the other side should be suitable for transmission to the 1500 MTU vhost-user port.

wangjun0728 commented 5 months ago

Wait a second... Why doesn't tso_segsz solve this issue for us? If we receive a TSO packet from vhost-user it should have TSO segment size set according to 1500 MTU of the vhost-user interface. And later the physical NIC should segment the packet according to this segment size value. So, segments received on the other side should be suitable for transmission to the 1500 MTU vhost-user port.

Yes, I completely agree with this point. In practical usage, Geneve may have variable lengths, so strictly ensuring the MTU could pose issues. My understanding here is that the physical port's MTU >= the encapsulation length + vhost-user MTU. This is because in kernel scenarios, such usage is normal, and strictly enforcing the physical port's MTU = encapsulation length + vhost-user MTU might be unreasonable. Therefore, I suspect there may be a bug in the logic of tso_segsz in this context.

wangjun0728 commented 5 months ago

@wangjun0728 yes, I had found out about this difference reading the datasheets.

I spent some time on the topic. Could you have a try with this OVS branch of mine? https://github.com/david-marchand/ovs/commits/tunnel_offloading_fix

@david-marchand Additionally, with the OVS modification I'm using, I found that when TSO is enabled, the 82599 network card generates a large number of errors when sending TCP packets. E810 are working fine. I will try rolling back this modification to see if it makes a difference.

  2024-04-09T02:22:00.969Z|00004|netdev_dpdk(pmd-c02/id:92)|WARN|Dropped 160 log messages in last 274 seconds (most recently, 269 seconds ago) due to excessive rate
  2024-04-09T02:22:00.969Z|00005|netdev_dpdk(pmd-c02/id:92)|WARN|tun_port_p1: Output batch contains invalid packets. Only 0/1 are valid: Operation not supported
  2024-04-09T02:22:00.972Z|00006|netdev_dpdk(pmd-c02/id:92)|DBG|tun_port_p1: First invalid packet:
  dump mbuf at 0x19539fac0, iova=0x18f2b6b40, buf_len=7496
    pkt_len=7416, ol_flags=0x2884800000000182, nb_segs=1, port=65535, ptype=0
    segment at 0x19539fac0, data=0x18f2b6b82, len=7416, off=66, refcnt=1
    Dump data at [0x18f2b6b82], len=7416
  00000000: 40 A6 B7 21 92 8C 68 91 D0 65 C6 C3 81 00 00 5C | @..!..h..e.....\
  00000010: 08 00 45 00 1C E6 00 00 40 00 40 11 BB 9F 0A FD | ..E.....@.@.....
  00000020: 26 38 0A FD 26 36 D4 B9 17 C1 1C D2 BA BF 02 40 | &8..&6.........@
  00000030: 65 58 00 00 31 00 01 02 80 01 00 04 00 06 0A C8 | eX..1...........
  00000040: E1 5C 84 0E 06 AF A9 F4 AA D6 08 00 45 00 1C AC | .\..........E...
  00000050: 0F 14 00 00 40 06 3B 2B 0A 00 00 09 0A 00 00 05 | ....@.;+........
  00000060: DA EE 14 51 03 06 B3 D6 18 43 BA BA 50 18 00 7E | ...Q.....C..P..~
  00000070: BD FC 00 00 0D 0B EF DE 52 E6 30 7A D1 0C AB B8 | ........R.0z....
  00000080: B9 D0 C3 9C 07 52 C1 4F E0 A3 92 62 02 8D B1 6B | .....R.O...b...k
  00000090: A5 FA 48 06 44 D4 4D A7 9F 39 5C A9 A2 79 D4 62 | ..H.D.M..9\..y.b
  000000A0: A7 54 A4 0B CD 1F 6D 1F 66 26 D5 31 A2 8E 70 37 | .T....m.f&.1..p7
  000000B0: 6F 34 29 FA F5 BE 0F 49 21 4C FE 03 F0 38 AA 06 | o4)....I!L...8..
  000000C0: C6 4B 3E 3B 19 36 9E 51 18 35 E0 D2 3A C1 14 39 | .K>;.6.Q.5..:..9

wangjun0728 commented 5 months ago

Rolling back your modification didn't resolve the issue; it seems that the 82599 network card doesn't support enabling TSO.

The same issue exists on the Mellanox CX5 network card. Even though the MTU has been adjusted to 1558, iperf cannot send a large number of TCP packets. However, unlike the 82599 network card, no similar errors have been observed with the Mellanox CX5 card.

So, does it mean that TSO cannot be enabled if the outer UDP checksum offload is not supported?

  2024-04-09T05:59:20.584Z|00003|netdev_dpdk(pmd-c02/id:88)|WARN|Dropped 302 log messages in last 67 seconds (most recently, 57 seconds ago) due to excessive rate
  2024-04-09T05:59:20.584Z|00004|netdev_dpdk(pmd-c02/id:88)|WARN|tun_port_p1: Output batch contains invalid packets. Only 0/1 are valid: Operation not supported
  2024-04-09T05:59:20.588Z|00005|netdev_dpdk(pmd-c02/id:88)|DBG|tun_port_p1: First invalid packet:
  dump mbuf at 0x1938b6e00, iova=0x18f261d80, buf_len=7496
    pkt_len=7416, ol_flags=0x2884800000000182, nb_segs=1, port=65535, ptype=0
    segment at 0x1938b6e00, data=0x18f261dc2, len=7416, off=66, refcnt=1
    Dump data at [0x18f261dc2], len=7416
  00000000: 40 A6 B7 21 92 8C 68 91 D0 65 C6 C3 81 00 00 5C | @..!..h..e.....\
  00000010: 08 00 45 00 1C E6 00 00 40 00 40 11 BB 9F 0A FD | ..E.....@.@.....
  00000020: 26 38 0A FD 26 36 DF 9B 17 C1 1C D2 AF DD 02 40 | &8..&6.........@
  00000030: 65 58 00 00 31 00 01 02 80 01 00 04 00 06 0A C8 | eX..1...........
  00000040: E1 5C 84 0E 06 AF A9 F4 AA D6 08 00 45 00 1C AC | .\..........E...
  00000050: A8 3A 00 00 40 06 A2 04 0A 00 00 09 0A 00 00 05 | .:..@...........
  00000060: 9C 44 14 51 06 83 A9 67 58 87 A7 38 50 18 00 7E | .D.Q...gX..8P..~
  00000070: AD E8 00 00 9D 23 90 5F 6F 71 50 F8 48 2D BD 13 | .....#._oqP.H-..

wangjun0728 commented 2 weeks ago

I use this patch from Mike, which can very well solve the problem that my ov-dpdk cannot enable TSO. Currently, I have verified that CX6/X710/E810, etc. are all normal. The performance of virtual machines on the same node is very high, and cross-node performance is still good. Only supported by E810. And the MTU configuration restrictions that existed before TSO was enabled on the E810 network card no longer exist. I still want to thank @mkp-rh , @david-marchand and @igsilya .

https://patchwork.ozlabs.org/project/openvswitch/list/?series=417313