p4lang / behavioral-model

The reference P4 software switch
Apache License 2.0
546 stars 329 forks source link

Problems encountered when connecting two virtual machines using bmv2 #1280

Closed git-liusen closed 13 hours ago

git-liusen commented 4 days ago

Environment: three kvm virtual machines

         vm1---------------br1----------------------bmv2-vm0--------------------br2----------------vm2
        enp7s0         vnet1 vnet2           enp7s0--bmv2--enp8s0           vnet3 vnet4          enp7s0
   192.168.11.2/24                                                                           192.168.11.5/24

So I have the following questions

  1. I tested that the bridge connection does not need to turn off the checksum offload function. Why do I need to turn off the checksum offload function when using bmv2?
  2. Do the packets processed by bmv2 pass through the network protocol stack? Can bmv2 use dpdk?
  3. Why can communication be normal after turning off the checksum offload function, but there are a large number of retransmission packets and high network latency?

I see there is a topic for checksum offload.[#1186 ] What should I do to achieve a normal network connection? I'm looking forward to your reply!

Below is my p4 code:

/* -*- P4_16 -*- */
#include <core.p4>
#include <v1model.p4>

const bit<16> TYPE_IPV4 = 0x800;
const bit<8>  TYPE_TCP  = 6;
const bit<8>  TYPE_UDP = 17;
const bit<32> I2E_CLONE_SESSION_ID = 100;

/*************************************************************************
*********************** H E A D E R S  ***********************************
*************************************************************************/

typedef bit<9>  egressSpec_t;
typedef bit<48> macAddr_t;
typedef bit<32> ip4Addr_t;

header ethernet_t {
    macAddr_t dstAddr;
    macAddr_t srcAddr;
    bit<16>   etherType;
}

header ipv4_t {
    bit<4>    version;
    bit<4>    ihl;
    bit<8>    diffserv;
    bit<16>   totalLen;
    bit<16>   identification;
    bit<3>    flags;
    bit<13>   fragOffset;
    bit<8>    ttl;
    bit<8>    protocol;
    bit<16>   hdrChecksum;
    ip4Addr_t srcAddr;
    ip4Addr_t dstAddr;
}

header tcp_t{
    bit<16> srcPort;
    bit<16> dstPort;
    bit<32> seqNo;
    bit<32> ackNo;
    bit<4>  dataOffset;
    bit<4>  res;
    bit<1>  cwr;
    bit<1>  ece;
    bit<1>  urg;
    bit<1>  ack;
    bit<1>  psh;
    bit<1>  rst;
    bit<1>  syn;
    bit<1>  fin;
    bit<16> window;
    bit<16> checksum;
    bit<16> urgentPtr;

}

header udp_t {
    bit<16> srcPort;
    bit<16> dstPort;
    bit<16> length;
    bit<16> checksum;
}

//**************************************************************

struct learn_t {
    bit<2> digest;
    bit<48> srcAddr;
    bit<9>  ingress_port;
}

struct metadata {
    learn_t learn;
}

//***************************************************************

struct headers {
    ethernet_t   ethernet;
    ipv4_t       ipv4;
}

/*************************************************************************
*********************** P A R S E R  ***********************************
*************************************************************************/

parser MyParser(packet_in packet,
                out headers hdr,
                inout metadata meta,
                inout standard_metadata_t standard_metadata) {

    state start {
        transition parse_ethernet;
    }

    state parse_ethernet {
        packet.extract(hdr.ethernet);
        transition select(hdr.ethernet.etherType) {
            TYPE_IPV4: parse_ipv4;
            default: accept;
        }
    }

    state parse_ipv4 {
        packet.extract(hdr.ipv4);
        transition accept;
    }
}

/*************************************************************************
************   C H E C K S U M    V E R I F I C A T I O N   *************
*************************************************************************/

control MyVerifyChecksum(inout headers hdr, inout metadata meta) {
    apply {  }
}

control MyIngress(inout headers hdr,
                  inout metadata meta,
                  inout standard_metadata_t standard_metadata) {

    action drop() {
        mark_to_drop(standard_metadata);
    }

    action mac_learn(){
        meta.learn.srcAddr = hdr.ethernet.srcAddr;
        meta.learn.ingress_port = standard_metadata.ingress_port;
        meta.learn.digest = 2;
        digest<learn_t>(1, meta.learn);
    }

    table smac {

        key = {
            hdr.ethernet.srcAddr: exact;
        }

        actions = {
            mac_learn;
            NoAction;
        }
        size = 256;
        default_action = mac_learn;
    }

    action forward(bit<9> egress_port) {
        standard_metadata.egress_spec = egress_port;
    }

    table dmac {
        key = {
            hdr.ethernet.dstAddr: exact;
        }

        actions = {
            forward;
            NoAction;
        }
        size = 256;
        default_action = NoAction;
    }

    action set_mcast_grp(bit<16> mcast_grp) {
        standard_metadata.mcast_grp = mcast_grp;
    }

    table broadcast {
        key = {
            standard_metadata.ingress_port: exact;
        }

        actions = {
            set_mcast_grp;
            NoAction;
        }
        size = 256;
        default_action = NoAction;
    }

    apply {
        //
        smac.apply();
        if (dmac.apply().hit){
            //
        }
        else{
            broadcast.apply();
        }
    }

}

/*************************************************************************
****************  E G R E S S   P R O C E S S I N G   *******************
*************************************************************************/

control MyEgress(inout headers hdr,
                 inout metadata meta,
                 inout standard_metadata_t standard_metadata) {
    apply {

    }
}

/*************************************************************************
*************   C H E C K S U M    C O M P U T A T I O N   **************
*************************************************************************/

control MyComputeChecksum(inout headers  hdr, inout metadata meta) {
     apply {
        update_checksum(
        hdr.ipv4.isValid(),
            { hdr.ipv4.version,
              hdr.ipv4.ihl,
              hdr.ipv4.diffserv,
              hdr.ipv4.totalLen,
              hdr.ipv4.identification,
              hdr.ipv4.flags,
              hdr.ipv4.fragOffset,
              hdr.ipv4.ttl,
              hdr.ipv4.protocol,
              hdr.ipv4.srcAddr,
              hdr.ipv4.dstAddr },
            hdr.ipv4.hdrChecksum,
            HashAlgorithm.csum16);
    }
}

/*************************************************************************
***********************  D E P A R S E R  *******************************
*************************************************************************/

control MyDeparser(packet_out packet, in headers hdr) {
    apply {
        packet.emit(hdr.ethernet);
        packet.emit(hdr.ipv4);

    }
}

/*************************************************************************
***********************  S W I T C H  *******************************
*************************************************************************/

V1Switch(
MyParser(),
MyVerifyChecksum(),
MyIngress(),
MyEgress(),
MyComputeChecksum(),
MyDeparser()
) main;
jafingerhut commented 4 days ago

Others can probably provide more authoritative answers, but I believe that regarding the disabling of rx/tx checksum offload, the basic answer is as follows:

If a NIC driver tells the Linux kernel that rx and tx checksum offload are enabled, then the Linux kernel saves some CPU cycles while processing each packet, because the NIC driver is telling the kernel "you don't have to calculate these checksums, because the NIC will do them for you".

If a NIC driver tells the Linux kernel that rx and tx checksum offload are disabled, then the Linux kernel goes to the extra effort of calculating TCP and UDP checksums itself for each such packet. The extra computation is not terribly large -- it becomes most noticeable at higher network data rates, which should not be an issue in your testing.

I believe that with the virtual NICs used in the kind of setup that you have, e.g. veth pairs, the veth implementation does not implement these checksum offload features. So if the driver tells the Linux kernel that rx/tx offload are enabled, that is actually incorrect, they are not enabled. It is more truthful to disable them, so that the Linux kernel will calculate these checksums.

I do not know the reason for the high latency and retransmissions in your setup. Have you tried also disable rx/tx checksum offload for the interfaces in the VM where the BMv2 simple_switch or simple_switch_grpc process is running?

antoninbas commented 4 days ago

I would recommend trying to disable scatter-gather (sg) on enp7s0 as well. After that you can try capturing the traffic at each interface to see if an issue shows up.

Do the packets processed by bmv2 pass through the network protocol stack? Can bmv2 use dpdk?

No and no

git-liusen commented 3 days ago

But I used a bridge instead of BMv2 in the virtual machine to connect ENP7S0 and ENP8S0 together, and they can communicate normally without turning off the checksum offloading function, and there is no retransmission of packets. I tried to disable the tx and rx verification and uninstallation functions of the virtual machine where Simple_Switch_gRPC is located, and also disable sg, but it did not solve the problem.

The following figure shows the communication status when connected through a bridge

image
antoninbas commented 3 days ago

But I used a bridge instead of BMv2 in the virtual machine to connect ENP7S0 and ENP8S0 together

That's comparing apples to oranges. When you use a bridge, the traffic is handled by the Linux kernel. When you use the bmv2, all packets are sent to a userspace process (simple_switch_grpc) using raw sockets.

jafingerhut commented 3 days ago

Antonin (or anyone reading this who knows), I know that there is a reliable way to see the full contents of any packet received or transmitted by the BMv2 software switch. Just add a command line option like this to the simple_switch or simple_switch_grpc command line: --dump-packet-data 10000 (the 10000 is the maximum number of bytes of each packet to print in the log).

I know you can use tcpdump or wireshark on veth interfaces to see packets going across them, but it is not clear to me when you do that whether the packet contents are shown before or after checksum calculations are done in the kernel (if they are done in the kernel at all, which they will not be if the NIC tx checksum offloading is enabled).

Having a reliable way to know the contents of the packet at multiple places along the path in a scenario like the one described in this issue would go a long way to understanding if checksumming is the problem.

Note: Even if the checksums of such packets are questionable, the presence or absence of packets shown by tcpdump/wireshark for a veth interface should be 100% accurate, at least when the packet rates are low enough that the CPU load is low.

antoninbas commented 3 days ago

Given this topology: enp7s0--bmv2--enp8s0, I would run a separate packet capture on both of these virtual interfaces to see if anything interesting shows up. The checksum settings shouldn't really matter on enp7s0 and enp8s0, given that bmv2 uses raw sockets.

git-liusen commented 13 hours ago
sudo ethtool -K enp7s0 gro off lro off
sudo ethtool -K enp8s0 gro off lro off

I found the problem of packet retransmission. Because of the automatic fragmentation reassembly function of the network card, the data packet received by the switch exceeds the MTU, and the packet is lost, resulting in retransmission.

git-liusen commented 13 hours ago

Thank you so much