networkservicemesh / cmd-forwarder-vpp

Apache License 2.0
2 stars 22 forks source link

Enable AF_XDP for cmd-forwarder-vpp management interface #283

Closed edwarnicke closed 1 year ago

edwarnicke commented 3 years ago

Currently cmd-forwarder-vpp uses AF_PACKET to bind to an existing Node interface using LinkToAfPacket

AF_XDP is faster than than AF_PACKET, but AF_XDP is only useable for our purposes from kernel version 5.4 onward. The good news is that lots of places have kernel versions that new (including the more recent version of Docker Desktop).

AF_XDP is supported in govpp

Because AF_XDP is only supported for newer kernels, a check will need to be made and then the correct method (AF_XDP if available, otherwise AF_PACKET) before choosing the method to use.

glazychev-art commented 3 years ago

blocked by https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/284

For some reason AF_XDP doesn't work fine with vpp v20.09

glazychev-art commented 3 years ago

Found a problem on clusters - forwarder just hangs during the start without any logs. I tested it on kind and packet cluster - situation is the same.

Created a JIRA issue - https://jira.fd.io/browse/VPP-1994

glazychev-art commented 1 year ago

It seems that it became clear why we see the forwarder (and node) hanging. If I understand correctly, AF_XDP moves frames directly to VPP, without Linux network stack. But we know that the forwarder uses hostNetwork: true - https://github.com/networkservicemesh/deployments-k8s/blob/main/apps/forwarder-vpp/forwarder.yaml#L19. This is required for the interdomain.

So, when VPP takes the uplink interface, it grabs the primary node interface. And traffic goes directly to the VPP, bypassing Linux. Therefore, we lose connection with the node and it seems to us that it hangs.

@edwarnicke As I see it, Calico-vpp has a similar scenario - https://projectcalico.docs.tigera.io/reference/vpp/host-network Should we take a similar approach?

denis-tingaikin commented 1 year ago

@glazychev-art

As I see it, Calico-vpp has a similar scenario - https://projectcalico.docs.tigera.io/reference/vpp/host-network Should we take a similar approach?

Could you please say more?

Also, as I know AF_XDP is not working with calico. Am I wrong?

edwarnicke commented 1 year ago

@glazychev-art Look into AD_XDP and eBPF. You should be able to craft an eBPF program that is passed in for AD_XDP that only passes on VXLAN/Wireguard/IPSEC packets (sort of like pinhole) and then that traffic will go to VPP, and all other traffic will go to the kernel interface.

glazychev-art commented 1 year ago

Most likely the action plan will be:

glazychev-art commented 1 year ago

Current state:

  1. Prepared a eBPF program
  2. Built govpp with this patch - https://gerrit.fd.io/r/c/vpp/+/37274
  3. Run cmd-forwarder-vpp docker tests - they are working very well. They don't work without the patch from step 2
  4. Still have a problem with kubernetes - forwarders not responding after creation

There was an idea to update VPP to the latest version. Docker tests also don't work without https://gerrit.fd.io/r/c/vpp/+/37274, and the problem with kubernetes was not resolved.

Perhaps the patch https://gerrit.fd.io/r/c/vpp/+/37274 is not entirely correct if we run the cluster locally (kind). I continue to work in this direction.

edwarnicke commented 1 year ago

@glazychev-art Is calico-vpp being on an older vpp version still blocking us updating to a more recent vpp version?

glazychev-art commented 1 year ago

@edwarnicke Not really - it was updated recently (on main branch) - https://github.com/projectcalico/vpp-dataplane/commit/d8288e154cfb7d757e039d3a707d67ac4a0c5e49. I've tested this vpp revision and seen a few problems:

  1. Minor - we need to use newer api version for AF_PACKET. For unknown reasons, our current one no longer works.
  2. More serious - with many improvements to Wireguard vpp, the event mechanism (when we get that wireguard interface is ready) was broken. Will need to figure it out
  3. We need to deal with ACLs because our current usage returns an error

Do we need to update?

edwarnicke commented 1 year ago

@glazychev-art Its probably a good idea to update yes

edwarnicke commented 1 year ago

@glazychev-art It might also be a good idea to put in tests in VPP to prevent some of the breakage we are seeing happening in the future.

glazychev-art commented 1 year ago

@edwarnicke I have a question related to eBPF program. Currently I've implemented it so that it only filters IP UDP packets based on a port.

But what do we do with ARP packets? We definitely need ARP packets to be handled by the kernel for the proper pod function. On the other hand, we also need ARP in the VPP so that we can find out the MAC addresses of other forwarders.

Perhaps we need also filter frames by Destination MAC, if they are different for VPP and kernel interfaces

Do you have any thoughts?

edwarnicke commented 1 year ago

@glazychev-art Could you point me to your existing eBPF program?

edwarnicke commented 1 year ago

@glazychev-art Have you looked at bpf_clone_redirect() ?

glazychev-art commented 1 year ago

@edwarnicke Currenlty eBPF program looks like

/*
 * SPDX-License-Identifier: GPL-2.0 OR Apache-2.0
 * Dual-licensed under GPL version 2.0 or Apache License version 2.0
 * Copyright (c) 2020 Cisco and/or its affiliates.
 */
#include <linux/bpf.h>
#include <linux/in.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>
#include <bpf/bpf_helpers.h>

/*
 * when compiled, debug print can be viewed with eg.
 * sudo cat /sys/kernel/debug/tracing/trace_pipe
 */
#ifdef DEBUG
#define s__(n)   # n
#define s_(n)    s__(n)
#define x_(fmt)  __FILE__ ":" s_(__LINE__) ": " fmt "\n"
#define DEBUG_PRINT_(fmt, ...) do { \
    const char fmt__[] = fmt; \
    bpf_trace_printk(fmt__, sizeof(fmt), ## __VA_ARGS__); } while(0)
#define DEBUG_PRINT(fmt, ...)   DEBUG_PRINT_ (x_(fmt), ## __VA_ARGS__)
#else   /* DEBUG */
#define DEBUG_PRINT(fmt, ...)
#endif  /* DEBUG */

#define ntohs(x)        __constant_ntohs(x)
#define MAX_NR_PORTS 65536

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, MAX_NR_PORTS);
    __type(key, int);
    __type(value, unsigned short int);
    __uint(pinning, LIBBPF_PIN_BY_NAME);
} ports_map SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_XSKMAP);
    __uint(max_entries, 64);
    __type(key, int);
    __type(value, int);
} xsks_map SEC(".maps");

SEC("xdp_sock")
int xdp_sock_prog(struct xdp_md *ctx) {
    const void *data = (void *)(long)ctx->data;
    const void *data_end = (void *)(long)ctx->data_end;
    int qid = ctx->rx_queue_index;

    DEBUG_PRINT("rx %ld bytes packet", (long)data_end - (long)data);

    if (data + sizeof(struct ethhdr) > data_end) {
        DEBUG_PRINT("packet too small");
        return XDP_PASS;
    }

    const struct ethhdr *eth = data;
    if (eth->h_proto != ntohs(ETH_P_IP) && eth->h_proto != ntohs(ETH_P_ARP)) {
          return XDP_PASS;
    }

    if (eth->h_proto == ntohs(ETH_P_ARP)) {
      if (!bpf_map_lookup_elem(&xsks_map, &qid))
      {
        DEBUG_PRINT("no socket found");
        return XDP_PASS;
      }

      DEBUG_PRINT("going to socket %d", qid);
      return bpf_redirect_map(&xsks_map, qid, 0);
    }

    if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) + sizeof(struct udphdr) > data_end) {
        DEBUG_PRINT("packet too small");
        return XDP_PASS;
    }

    const struct iphdr *ip = (void *)(eth + 1);
    switch (ip->protocol) {
      case IPPROTO_UDP: {
            const struct udphdr *udp = (void *)(ip + 1);
            const int port = ntohs(udp->dest);
            if (!bpf_map_lookup_elem(&ports_map, &port))
            {
                DEBUG_PRINT("unsupported udp dst port %x", (int)udp->dest);
            return XDP_PASS;
            }
            break;
          }
      default:
        DEBUG_PRINT("unsupported ip proto %x", (int)ip->protocol);
        return XDP_PASS;
    }

    if (!bpf_map_lookup_elem(&xsks_map, &qid))
      {
        DEBUG_PRINT("no socket found");
        return XDP_PASS;
      }

    DEBUG_PRINT("going to socket %d", qid);
    return bpf_redirect_map(&xsks_map, qid, 0);
}

/* actually Dual GPLv2/Apache2, but GPLv2 as far as kernel is concerned */
SEC("license")
char _license[] = "GPL";

In short, we pass all ARP packets to VPP and filter IP packets - if UDP port belongs to VxLAN, Wireguard and so on - we pass it VPP, otherwise - to kernel

glazychev-art commented 1 year ago

@edwarnicke Yes, I looked at long bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags). But as you can see, it receives sk_buff. So it seems, that we can call this function after XDP layer, when we already chose kernel (in TC ingress layer for example). Probably we need to create sk_buff manually in xdp function and call bpf_clone_redirect.

edwarnicke commented 1 year ago

@glazychev-art Trying to create an sk_buff sounds like it might be prone to error.

We may also want to think through what the problem really is. Is the problem that we are not receiving arp packets, or is the problem how we construct our neighbor table in VPP?

glazychev-art commented 1 year ago

I think the problem is that we are not receiving arp packets. We construct the VPP neighbor table correctly - we take all ARP entries from the kernel known at the start time. Next, we need to know about other pods in the VPP - for example, about another forwarder in order to set up a tunnel. On the other hand, we also need to process arp in the kernel too - for example, when passing the request forwarder --> manager.

glazychev-art commented 1 year ago

@edwarnicke One more question: Should we consider updating the vpp as a separate issue? Or does it make sense to do this only with af_xdp?

edwarnicke commented 1 year ago

@glazychev-art Lets update vpp as a separate issue. We should do that even if we don't get AF_XDP going.

edwarnicke commented 1 year ago

@glazychev-art Would it make sense to use NeighSubscribeAt and IPNeighborAddDel too remove the need for VPP to receive arp packets?

glazychev-art commented 1 year ago

@edwarnicke Not really. We need to fully service ARP requests from both the Linux side and the VPP side. NeighSubscribeAt and IPNeighborAddDel - in this case, we redirect all ARPs to kernel, right?

For example, consider that Kernel and VPP MAC addresses are the same.

  1. Our server receives a broadcast ARP request - in this case it doesn't matter who answers - we have the same MAC and IP addresses.
  2. Our server sends ARP request: a. Kernel side sends ARP request (for communication between pods) b. VPP side sends ARP request (for communication between forwarders when creating a tunnel)

So, kernel will only accept and remember the response if it sent the request itself. If the request was sent from the VPP side, the kernel will skip this response and forwarder won't get anything from NeighSubscribeAt.

edwarnicke commented 1 year ago

So, kernel will only accept and remember the response if it sent the request itself.

Have we checked this? It might be true, but I wouldn't simply presume it.

glazychev-art commented 1 year ago

I think I tested something similar. Without NeighSubscribeAt, but I looked at ip neigh.

But definitely, we need to double-check that.

glazychev-art commented 1 year ago

@edwarnicke It looks like that NeighSubscribeAt and IPNeighborAddDel are working fine for IPv4 interfaces.

But this is not the case for IPv6. Since it has neighbor mechanism, Linux side doesn't save NA (Neighbor Advertisement) if we send NS (Neighbor Solicitation) from the VPP side. I tried changing the Solicited and Override flags in the response but it didn't help.

Should we continue to work in this direction or does it make sense to implement only IPv4?

glazychev-art commented 1 year ago

Current state:

glazychev-art commented 1 year ago

Current state:

glazychev-art commented 1 year ago

I've tried to resolve IPv6 neighbors in the kernel space manually. And it works, because the forwarder receives the event from the netlink and adds the neighbor to vpp via IPNeighborAddDel. Ping works after that.

edwarnicke commented 1 year ago

Are we typically looking for anything other than the mac address of the gateway IP for the IPv6 case?

If so, could we simply scrape the linux Neighbor table for v6?

edwarnicke commented 1 year ago

This may also help:

https://insights.sei.cmu.edu/blog/ping-sweeping-in-ipv6/

glazychev-art commented 1 year ago

Current state:

Instead, we can resolve gateways for a given interface in a slightly different way. Before creating AF_XDP, we can use netlink.RouteList and then ping every gateway found. This will allow us to add neighbor entries to the linux. And they will later be read and added to the VPP.

@edwarnicke What do you think?

glazychev-art commented 1 year ago

@edwarnicke It seems that it is not possible to run more than one AF_XDP forwarder on one node, unlike AF_PACKET (forwarders use hostNetwork). Logs from the second:

af_xdp               [error ]: af_xdp_create_queue: xsk_socket__create() failed (is linux netdev vpp1host up?): Device or resource busy
create interface af_xdp: xsk_socket__create() failed (is linux netdev vpp1host up?): Device or resource busy
glazychev-art commented 1 year ago

Current state: Tested a new forwarder on public clusters: GKE - doesn't start. Logs from forwarder:

Apr  3 05:38:16.954 [INFO] [cmd:vpp] libbpf: Kernel error message: virtio_net: XDP expects header/data in single page, any_header_sg required
Apr  3 05:38:16.954 [INFO] [cmd:vpp] vpp[10244]: af_xdp: af_xdp_load_program: bpf_set_link_xdp_fd(eth0) failed: Invalid argument
Apr  3 05:38:18.228 [ERRO] [cmd:/bin/forwarder] [duration:12.809608ms] [hostIfName:eth0] [vppapi:AfXdpCreate] VPPApiError: System call error #6 (-16)
panic: error: VPPApiError: System call error #6 (-16)

AWS - doesn't start. Logs from forwarder:

Apr  3 13:24:25.406 [INFO] [cmd:vpp] libbpf: Kernel error message: veth: Peer MTU is too large to set XDP
Apr  3 13:24:25.406 [INFO] [cmd:vpp] vpp[10508]: af_xdp: af_xdp_load_program: bpf_set_link_xdp_fd(eth0) failed: Numerical result out of range
Apr  3 13:24:26.563 [ERRO] [cmd:/bin/forwarder] [duration:18.015838ms] [hostIfName:eth0] [vppapi:AfXdpCreate] VPPApiError: System call error #6 (-16)
panic: error: VPPApiError: System call error #6 (-16)

Packet - started, but ping doesn't work. This is most likely due to the fact that af_packet vpp plugin doesn't process bonded interfaces (they are used by packet) AKS - ping works only without hostNetwork: true flag. But poor performance (compared to AF_PACKET about 2 times slower) Kind - works, but performance has not increased (even decreased slightly).

Measurements on Kind

iperf3 TCP

Ethernet remote mechanism (VxLAN)

_AF_PACKET:_

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 43488 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  46.6 MBytes   391 Mbits/sec  174    969 KBytes       
[  5]   1.00-2.00   sec  48.8 MBytes   409 Mbits/sec    0   1.02 MBytes       
[  5]   2.00-3.00   sec  58.8 MBytes   493 Mbits/sec    0   1.07 MBytes       
[  5]   3.00-4.00   sec  53.8 MBytes   451 Mbits/sec    0   1.10 MBytes       
[  5]   4.00-5.00   sec  46.2 MBytes   388 Mbits/sec    0   1.12 MBytes       
[  5]   5.00-6.00   sec  62.5 MBytes   524 Mbits/sec    0   1.13 MBytes       
[  5]   6.00-7.00   sec  45.0 MBytes   377 Mbits/sec    0   1.14 MBytes       
[  5]   7.00-8.00   sec  65.0 MBytes   545 Mbits/sec    0   1.18 MBytes       
[  5]   8.00-9.00   sec  56.2 MBytes   472 Mbits/sec    0   1.22 MBytes       
[  5]   9.00-10.00  sec  45.0 MBytes   377 Mbits/sec    0   1.24 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   528 MBytes   443 Mbits/sec  174             sender
[  5]   0.00-10.08  sec   526 MBytes   438 Mbits/sec                  receiver

_AF_XDP:_

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 36586 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  46.9 MBytes   393 Mbits/sec  1326    113 KBytes       
[  5]   1.00-2.00   sec  41.3 MBytes   346 Mbits/sec  1114   42.2 KBytes       
[  5]   2.00-3.00   sec  36.2 MBytes   304 Mbits/sec  1058   34.0 KBytes       
[  5]   3.00-4.00   sec  54.2 MBytes   455 Mbits/sec  1560   20.4 KBytes       
[  5]   4.00-5.00   sec  36.3 MBytes   304 Mbits/sec  1149   44.9 KBytes       
[  5]   5.00-6.00   sec  27.9 MBytes   234 Mbits/sec  953   20.4 KBytes       
[  5]   6.00-7.00   sec  37.9 MBytes   318 Mbits/sec  1106   25.9 KBytes       
[  5]   7.00-8.00   sec  33.1 MBytes   278 Mbits/sec  964   25.9 KBytes       
[  5]   8.00-9.00   sec  39.2 MBytes   329 Mbits/sec  1448   32.7 KBytes       
[  5]   9.00-10.00  sec  51.1 MBytes   429 Mbits/sec  1445   23.1 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   404 MBytes   339 Mbits/sec  12123             sender
[  5]   0.00-10.00  sec   403 MBytes   338 Mbits/sec                  receiver

Note: many Retrs

IP remote mechanism (Wireguard)

_AF_PACKET:_

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 49978 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  88.3 MBytes   740 Mbits/sec    2    487 KBytes       
[  5]   1.00-2.00   sec  87.4 MBytes   733 Mbits/sec    0    606 KBytes       
[  5]   2.00-3.00   sec  76.5 MBytes   642 Mbits/sec    6    495 KBytes       
[  5]   3.00-4.00   sec  74.6 MBytes   626 Mbits/sec    0    596 KBytes       
[  5]   4.00-5.00   sec  42.3 MBytes   355 Mbits/sec    0    649 KBytes       
[  5]   5.00-6.00   sec  21.7 MBytes   182 Mbits/sec    8    473 KBytes       
[  5]   6.00-7.00   sec  36.9 MBytes   310 Mbits/sec    0    545 KBytes       
[  5]   7.00-8.00   sec  88.9 MBytes   746 Mbits/sec    0    636 KBytes       
[  5]   8.00-9.00   sec  82.4 MBytes   691 Mbits/sec    8    539 KBytes       
[  5]   9.00-10.00  sec  92.0 MBytes   772 Mbits/sec    0    664 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   691 MBytes   580 Mbits/sec   24             sender
[  5]   0.00-10.03  sec   690 MBytes   577 Mbits/sec                  receiver

_AF_XDP:_

Connecting to host 172.16.1.100, port 5201
[  5] local 172.16.1.101 port 46608 connected to 172.16.1.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   104 MBytes   873 Mbits/sec   47    645 KBytes       
[  5]   1.00-2.00   sec  98.7 MBytes   828 Mbits/sec   39    538 KBytes       
[  5]   2.00-3.00   sec  90.9 MBytes   763 Mbits/sec    0    655 KBytes       
[  5]   3.00-4.00   sec  65.2 MBytes   547 Mbits/sec   14    533 KBytes       
[  5]   4.00-5.00   sec  53.3 MBytes   447 Mbits/sec    7    603 KBytes       
[  5]   5.00-6.00   sec  52.4 MBytes   440 Mbits/sec    0    660 KBytes       
[  5]   6.00-7.00   sec  39.1 MBytes   328 Mbits/sec    8    526 KBytes       
[  5]   7.00-8.00   sec  38.7 MBytes   325 Mbits/sec    0    587 KBytes       
[  5]   8.00-9.00   sec  94.8 MBytes   796 Mbits/sec    0    675 KBytes       
[  5]   9.00-10.00  sec  96.0 MBytes   805 Mbits/sec    7    618 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   733 MBytes   615 Mbits/sec  122             sender
[  5]   0.00-10.05  sec   732 MBytes   611 Mbits/sec                  receiver

iperf3 UDP

_AFPACKET

Accepted connection from 172.16.1.101, port 39452
[  5] local 172.16.1.100 port 5201 connected to 172.16.1.101 port 40692
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   118 MBytes   986 Mbits/sec  0.077 ms  525084/613923 (86%)  
[  5]   1.00-2.00   sec   117 MBytes   980 Mbits/sec  0.002 ms  576553/664766 (87%)  
[  5]   2.00-3.00   sec   120 MBytes  1.01 Gbits/sec  0.050 ms  576732/667716 (86%)  
[  5]   3.00-4.00   sec   120 MBytes  1.00 Gbits/sec  0.002 ms  581367/671794 (87%)  
[  5]   4.00-5.00   sec   120 MBytes  1.00 Gbits/sec  0.002 ms  612951/703307 (87%)  
[  5]   5.00-6.00   sec   122 MBytes  1.03 Gbits/sec  0.001 ms  535717/628083 (85%)  
[  5]   6.00-7.00   sec   117 MBytes   980 Mbits/sec  0.041 ms  578869/667122 (87%)  
[  5]   7.00-8.00   sec   119 MBytes  1.00 Gbits/sec  0.002 ms  577990/668247 (86%)  
[  5]   8.00-9.00   sec   116 MBytes   974 Mbits/sec  0.002 ms  582754/670426 (87%)  
[  5]   9.00-10.00  sec   120 MBytes  1.01 Gbits/sec  0.024 ms  579465/670305 (86%)  
[  5]  10.00-10.21  sec  2.50 MBytes   100 Mbits/sec  0.002 ms  38604/40489 (95%)  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.21  sec  1.16 GBytes   979 Mbits/sec  0.002 ms  5766086/6666178 (86%)  receiver

_AFXDP

[  5] local 172.16.1.100 port 5201 connected to 172.16.1.101 port 41437
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-1.00   sec   156 MBytes  1.31 Gbits/sec  0.001 ms  491872/609832 (81%)  
[  5]   1.00-2.00   sec   168 MBytes  1.41 Gbits/sec  0.001 ms  557337/684419 (81%)  
[  5]   2.00-3.00   sec   166 MBytes  1.39 Gbits/sec  0.001 ms  551925/677423 (81%)  
[  5]   3.00-4.00   sec   163 MBytes  1.36 Gbits/sec  0.001 ms  557553/680349 (82%)  
[  5]   4.00-5.00   sec   165 MBytes  1.38 Gbits/sec  0.001 ms  553140/677503 (82%)  
[  5]   5.00-6.00   sec   170 MBytes  1.43 Gbits/sec  0.002 ms  558848/687616 (81%)  
[  5]   6.00-7.00   sec   161 MBytes  1.35 Gbits/sec  0.001 ms  558833/680687 (82%)  
[  5]   7.00-8.00   sec   162 MBytes  1.36 Gbits/sec  0.001 ms  575608/698261 (82%)  
[  5]   8.00-9.00   sec   163 MBytes  1.36 Gbits/sec  0.001 ms  550618/673519 (82%)  
[  5]   9.00-10.00  sec   169 MBytes  1.42 Gbits/sec  0.001 ms  555133/683148 (81%)  
[  5]  10.00-11.00  sec   434 KBytes  3.55 Mbits/sec  3.840 ms  0/320 (0%)  
[  5]  11.00-12.00  sec  43.4 KBytes   355 Kbits/sec  7.520 ms  0/32 (0%)

Conclusions

Client sends UDP: AF_XDP is faster than AF_PACKET by ~40% (1.37 Gbits/sec vs 0.98 Gbits/sec)

Client sends TCP: Average of 10 runs Ethernet: AF_PACKET is faster than AF_XDP by ~13% (460.3 Mbits/sec vs 407.2 Mbits/sec) IP: AF_XDP is equal to AF_PACKET (372,1 Mbits/sec vs 370,2 Mbits/sec)

glazychev-art commented 1 year ago

Estimation

To run ci on kind cluster with xdp we need:

  1. Prepare a PR for sdk-vpp ~ 1h
  2. Prepare a PR for cmd-forwarder-vpp ~ 1h
  3. Add a new afxdp suite to deployments-k8s ~ 2h
  4. Add and test the suite on kind ~ 2h
  5. Risks ~ 2h
glazychev-art commented 1 year ago

@edwarnicke Due to problems with public clusters (see the beginning of the post), there is an option to support af_xdp only on kind in this release. What do you think of it?

edwarnicke commented 1 year ago

@glazychev-art Its strange that AF_PACKET is faster for TCP but slower for UDP. Do we have any notion of why?

glazychev-art commented 1 year ago

@edwarnicke Yes, there are a couple of guesses:

  1. If we look at iperf3 logs from TCP mode, we will look a huge number of retransmissions:
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  46.9 MBytes   393 Mbits/sec  1326    113 KBytes       
    [  5]   1.00-2.00   sec  41.3 MBytes   346 Mbits/sec  1114   42.2 KBytes
    ...

    (we don't see them with AF_PACKET)

  2. I was able to reproduce something similar on bare vpp instances: https://lists.fd.io/g/vpp-dev/topic/af_xdp_performance/98105671
  3. If we look at the vpp gerrit, we can see several open af_xdp patches, that the owners claim will greatly increase performance (I tried them, it didn't help for TCP). https://gerrit.fd.io/r/c/vpp/+/37653 https://gerrit.fd.io/r/c/vpp/+/38135

So, I think the problem may be in the VPP plugin.

glazychev-art commented 1 year ago

As part of this task, we have done the integration of AF_XDP interface on the kind cluster. This is working successfully. https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/4798046461/jobs/8535800517

On public clusters, we ran into problems. Separate issues were created https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/859 Performance: https://github.com/networkservicemesh/cmd-forwarder-vpp/issues/860

I think this issue can be closed