VM TCP_STREAM sending slow for ptnet cross hypervisors

luigirizzo / netmap

Automatically exported from code.google.com/p/netmap

BSD 2-Clause "Simplified" License

1.86k stars 537 forks source link

VM TCP_STREAM sending slow for ptnet cross hypervisors #387

Closed snowcherry closed 7 years ago

snowcherry commented 7 years ago

I tested two VMs launched on the same hypervisor connecting through vale, the speed is just fine, both scp sending and receiving rate are more than 100MB/s, but if the two VMs are launched on different hypervisors connecting through vale + eth0, the sending rate becomes only 3MB/s and getting slower and slower. The receiving rate is still all right as I tested sending from another hypervisor directly. One thing I notice is if I turned tso, gso and gro on both sides, the sending rate cross hypers improved a lot, but within one hyper decreased a lot. Below is the iperf in VM with ptnet sending and receiving rate, the other peer is another hypervisor's eth0

Sending with tso, gso, gro on:
iperf -c 172.10.10.5 -P 4 -i 3 -t 60         
------------------------------------------------------------
Client connecting to 172.10.10.5, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  6] local 172.10.10.20 port 45096 connected with 172.10.10.5 port 5001
[  3] local 172.10.10.20 port 45090 connected with 172.10.10.5 port 5001
[  4] local 172.10.10.20 port 45092 connected with 172.10.10.5 port 5001
[  5] local 172.10.10.20 port 45094 connected with 172.10.10.5 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0- 3.0 sec   896 KBytes  2.45 Mbits/sec
[  3]  0.0- 3.0 sec   896 KBytes  2.45 Mbits/sec
[  5]  0.0- 3.0 sec   896 KBytes  2.45 Mbits/sec
[  6]  0.0- 3.0 sec   640 KBytes  1.75 Mbits/sec
[SUM]  0.0- 3.0 sec  3.25 MBytes  9.09 Mbits/sec

Sending with tso, gso, gro off:
iperf -c 172.10.10.5 -P 8
------------------------------------------------------------
Client connecting to 172.10.10.5, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  4] local 172.10.10.20 port 49664 connected with 172.10.10.5 port 5001
[  3] local 172.10.10.20 port 49662 connected with 172.10.10.5 port 5001
[  5] local 172.10.10.20 port 49666 connected with 172.10.10.5 port 5001
[  9] local 172.10.10.20 port 49674 connected with 172.10.10.5 port 5001
[  7] local 172.10.10.20 port 49670 connected with 172.10.10.5 port 5001
[  8] local 172.10.10.20 port 49672 connected with 172.10.10.5 port 5001
[ 10] local 172.10.10.20 port 49676 connected with 172.10.10.5 port 5001
[  6] local 172.10.10.20 port 49668 connected with 172.10.10.5 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 9.0 sec   103 MBytes  96.1 Mbits/sec
[  9]  0.0- 9.0 sec   119 MBytes   111 Mbits/sec
[  7]  0.0- 9.0 sec  55.2 MBytes  51.4 Mbits/sec
[  5]  0.0- 9.0 sec   109 MBytes   101 Mbits/sec
[  6]  0.0- 9.0 sec   119 MBytes   110 Mbits/sec
[ 10]  0.0- 9.1 sec   138 MBytes   127 Mbits/sec
[  8]  0.0- 9.1 sec   277 MBytes   255 Mbits/sec
[  4]  0.0-10.4 sec   188 MBytes   152 Mbits/sec
[SUM]  0.0-10.4 sec  1.08 GBytes   895 Mbits/sec

Receiving:
iperf -s         
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 172.10.10.20 port 5001 connected with 172.10.10.5 port 36940
[  5] local 172.10.10.20 port 5001 connected with 172.10.10.5 port 36942
[  6] local 172.10.10.20 port 5001 connected with 172.10.10.5 port 36944
[  7] local 172.10.10.20 port 5001 connected with 172.10.10.5 port 36946
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-21.0 sec  5.71 GBytes  2.33 Gbits/sec
[  5]  0.0-21.0 sec  5.00 GBytes  2.04 Gbits/sec
[  6]  0.0-21.0 sec  5.40 GBytes  2.21 Gbits/sec
[  7]  0.0-21.0 sec  5.73 GBytes  2.34 Gbits/sec
[SUM]  0.0-21.0 sec  21.8 GBytes  8.92 Gbits/sec

snowcherry commented 7 years ago

All physical NICs and virtual NICs are in netmap mode, the pkt-gen rate for different peers are like this: Hyper1 eth0 <--> Hyper1 eth0: both sending and receiving are at 14Mpps Hyper eth0 --> Hyper2 vale1: eth0 --> VM vale1:10, sending 14Mpps, receiving 8.7Mpps VM vale1:10 --> Hyper2 vale1: eth0 --> Hyper1 eth0: sending 21.7Mpps, receiving 42.5Kpps (??????) Hyper1 vale1:20 <-->Hyper1 vale1:eth0 <--> Hyper2 vale1:eth0 <--> Hyper2 vale1:20: sending: 15Mpps, receiving 7.7Mpps

vmaffione commented 7 years ago

I assume you are using this QEMU https://github.com/vmaffione/qemu (configured with netmap and ptnetmap support) and latest netmap master (configured with ptnetmap support). Short story is: netmap physical ports don't support offloadings (tso and tx checksum offloading), while VALE port do support. If a TSO packets coming from a VALE port goes to a physical port, TSO and checksum is performed in software by VALE (which is slow, but not as slow as you say). It is possible to disable TSO/checksum for a VM by loading netmap (inside the VM) in this way:

  # modprobe netmap ptnet_vnet_hdr=0

If your VMs are all on the same hypervisor, you should use ptnet_vnet_hdr=1, and get 10-20 Gbps VM-to-VM TCP throughput (with netperf or similar). If your VMs are on different hypervisor, connected through the VALE switch as you say, for TCP workloads it may be better to use ptnet_vnet_hdr=0, so that you don't incur into the slow path (offloadings performed in software, see above). If you want to do middlebox packet processing, that is your VMs are not the endpoing of substantial TCP workload (e.g. pkt-gen, bridge, or real NFV applications), you should use ptnet_vnet_hdr=0, because you wouldn't use the offloadings anyway.

This said:

If you get 3MB/s in TCP tests there is something wrong, it's too low. Can you describe exactly all the steps that you followed to get this number. Also, retry with ptnet_vnet_hdr=0.
In your third pkt-gen test (the one where the receiver receives only 42 Kpps), can you try using ptnet_vnet_hdr=0 in the VM?

vmaffione commented 7 years ago

For completeness, this may be useful https://github.com/vmaffione/netmap-tutorial/blob/master/virtualization.pdf

snowcherry commented 7 years ago

Thanks for your reply, I have read that paper carefully a few times. I just tested with ptnet_vnet_hdr=0, the sending rate with iperf can be 5.7G/s, that is a good improvement before other optimism. also the pkt-gen rate is much more reasonable, sending rate 15Mpps and receiving 8Mpps, unlike the result in my second comment with question mark (VM vale1:10 --> Hyper2 vale1: eth0 --> Hyper1 eth0: sending 21.7Mpps, receiving 42.5Kpps (??????)). The scenario I want to use this for a cloud which I don't know where the VMs will end up on which hypervisors, so far ptnet_vnet_hdr=0 is acceptable.

The qemu I use is https://github.com/vmaffione/qemu branch ptnet, netmap version is v11.3 for both hyper and VM. (P.S. VM with master netmap.ko throws null pointer when link up, I can open another issue for that). Testing scenario is Hyper1 eth0 (172.10.10.5) <--> Hyper2 vale1:eth0 <-->VM vale1:10 (172.10.10.20)

Does ptnet_vnet_hdr=0 mean the frame doesn't have additional header so netmap can recognize it and do offload when it passes through vale port like a normal ethernet frame? Does it matter if I disable offload on eth0 or not?

snowcherry commented 7 years ago

Another thing is with ptnet_vnet_hdr=0, the sending rate is greatly increased, the cpu usage is increased a lot as well which is against normal because sending shouldn't use that a lot of cpu: Both sending and Receiving get about 5.7Gbps with gro off.

Sending CPU usage:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13290 root      20   0 5098416 717660  14008 S 300.0  0.5  20:34.15 qemu-system-x86                                                                                                                 
13394 root      20   0       0      0      0 D  57.5  0.0   2:32.60 nmkth:13295:0                                                                                                                   
13395 root      20   0       0      0      0 S  20.6  0.0   0:59.56 nmkth:13295:1   

Receiving cpu usage:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13290 root      20   0 5098416 717672  14008 S 249.5  0.5  25:41.33 qemu-system-x86                                                                                                                 
13394 root      20   0       0      0      0 D  34.6  0.0   3:24.60 nmkth:13295:0                                                                                                                   
13395 root      20   0       0      0      0 S  28.2  0.0   1:31.15 nmkth:13295:1

With gro on, the receive rate can get 9.3 Gbps while the sending remains the same. Also if look into nmkth kernel threads, the cpu affinity is always 0, I can not change it with taskset. That means all the ports share cpu 0, so when ports number increases, the performance is likely to drop? Can you help on this? Ptnet is a wonderful thought, I'm wondering how to use it in our production.

vmaffione commented 7 years ago

Keep in mind that we don't maintain versions, so if you use an old version you may miss important fixes and improvements. This said, I'm currently using ptnetmap with the netmap master branch, and I don't see any crashes. What is crashing, VM or host? Could you please open an issue for that, with all the details?

ptnet_vnet_hdr==0 means that the ptnet driver in the VM won't prepend the standard virtio-net-header (http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html, section 5.1.6) to each Ethernet packet to be transmitted, and wont' expect the header to be there on received packets. The header is to be used mainly for TSO offloading. Point is that if your VMs are not the endpoint of TCP connections that carry your workload (i.e. of course you will have ssh control connections but those are few KB/s and are not your workload), you don't really need TSO. This is very common if your VMs are doing middlebox processing (e.g. any NFV node). So yes, ptnet_vnet_hdr==0 means no additional headers, and no offloading is performed (in software nor in hardware). The packets are simply copied between VALE ports, without hitting the "software offloading slow path", because both vale1:eth0 and vale1:10 agree on the packet format. With ptnet_vnet_hdr!=0 you would have a packet format mismatch between vale1:eth0 (no header) and vale1:10 (header), and so the slow path is invoked to cope with it. As stated in LINUX/README, you should always disable all the offloadings on physical netmap ports, because netmap does not support them.

Using gro in the VM is ok (and often beneficial), because this is an optimization that involves only the VM network stack, and so it's orthogonal to netmap. I would suggest reading this paper http://info.iet.unipi.it/~luigi/papers/20160613-ptnet.pdf to understand what are the use-cases for which is better to use ptnetmap and those for which is better to use virtio-net+vhost-net.

Another thing you may try is to load netmap in the host with ptnetmap_tx_workers parameter set to 0. This will basically remove the netmap host tx worker, which means less CPU utilization, usually with no impact on pkt-gen-like performance.

Regarding the affinity issue, let me check it out. It must be possible of course to let kthreads to use different cores.

vmaffione commented 7 years ago

The affinity issue was a bug that slipped in somehow. It has been fixed by e4bb02b68c4d0cbd716e17075558e3cfe292ff44. Now you should be able to set the CPU affinity for the netmap kthreads at your wish. Could you please retry with the current master?

snowcherry commented 7 years ago

I have tried the latest master, the null pointer issue isn't there anymore, though I did met the link up issue two weeks ago. Anyway, not important now. The only issue is some error messages show up when make install and the installed ixgbe.ko doesn't seem to be the right one with netmap patch. After I ran make again and manually copied the ixgbe.ko, and depmod etc, then it worked. I know it sounds odd but I just report it here.

I read the paper as you mentioned and learned the point that for normal applications, it might be better to just choose virtio+vhost, before reading it, I got the wrong opinion that ptnet has less memory copy, but in fact on the contrary, vhost kernel thread saves the copy and avoids kernel and user space switch.

For the cpu usage for sending and receiving with ptnet_vnet_hdr==0, I think the main reason is because the VM OS has to compute all checksum and segmentation by itself. I also tested ptnet_vnet_hdr=1 with latest master, the speed is only 5-9Mbps, and the worse part is both vm and hypervisor went dead when I doing perf test, this happened all the time with tso, gso enabled.

Another confusion for me is that isn't bridge + virtio driver needs to handle the same thing for offloading? If a VM tap interface and a physical NIC are bridged together, assume the driver knows to remove the vnet_header, then the frames can go hardware offload. But as far as I remember, when it bridged with a vxlan interface, the speed didn't drop significantly, and no need to touch vNic's params in side of VMs, where is the offload supposed to happen?

vmaffione commented 7 years ago

2017-11-03 14:29 GMT+01:00 snowcherry notifications@github.com:

I have tried the latest master, the null pointer issue isn't there anymore, though I did met the link up issue two weeks ago. Anyway, not important now. The only issue is some error messages show up when make install and the installed ixgbe.ko doesn't seem to be the right one with netmap patch. After I ran make again and manually copied the ixgbe.ko, and depmod etc, then it worked. I know it sounds odd but I just report it here.

The patched kernel drivers are normally installed into /lib/modules/$(uname -r)/extra. If you modprobe ixgbe I guess the unpatched driver gets loaded, which is not what you want. That's why normally you should add something like --driver-suffix=_netmapto ./configure, so that you can do modprobe ixgbe_netmap and you are sure you are loading the right one.

I read the paper as you mentioned and learned the point that for normal applications, it might be better to just choose virtio+vhost, before reading it, I got the wrong opinion that ptnet has less memory copy, but in fact on the contrary, vhost kernel thread saves the copy and avoids kernel and user space switch.

ptnetmap is blazingly fast when you use netmap applications: nothing that virtio-based solutions can even get close to. The possibility to use ptnetmap also with socket applications (with good performance) is a nice additional feature that makes it appealing in general; however, because of the architecture of netmap an additional copy is needed. So in general with socket applications virtio+vhost has the better performance. But not always: for instance ptnetmap wins over vhost on VM-to-VM netperf TCP_RR tests. Consider that you can avoid the TX kernel thread with ptnetmap_tx_workers=0.

For the cpu usage for sending and receiving with ptnet_vnet_hdr==0, I think the main reason is because the VM OS has to compute all checksum and segmentation by itself. I also tested ptnet_vnet_hdr=1 with latest master, the speed is only 5-9Mbps, and the worse part is both vm and hypervisor went dead when I doing perf test, this happened all the time with tso, gso enabled.

Yes, the CPU usage for socket applications is higher as you say. But again the point is that ptnetmap is optimal when used with netmap applications, and usually suboptimal for socket applications (that you are mostly interested in, apparently). 5-9 Mbps is so low that is clearly a configuration problem or a bug (when the throughput is so low it means that is "triggered by some TCP timeout", which is a clear sign of systematic packet drops). Could you report the test details (where VMs are run, QEMU cmdline, vale-ctl commands, ...)? I guess your VMs are on different hosts, because when the VMs are on the same host one should get 10-20 Gbps.

Another confusion for me is that isn't bridge + virtio driver needs to handle the same thing for offloading? If a VM tap interface and a physical NIC are bridged together, assume the driver knows to remove the vnet_header, then the frames can go hardware offload. But as far as I remember, when it bridged with a vxlan interface, the speed didn't drop significantly, and no need to touch vNic's params in side of VMs, where is the offload supposed to happen?

I'm not sure I got your question, but: in-kernel linux bridge (br0), TAPs, vxlan interfaces are standard linux network interfaces, using the sk_buff as a packet representation. Because of this, hw offloadings are fully supported and you get the maximum TCP throughput. Netmap doesn't use sk_buff, and it doesn't support offloadings, because of its design. Netmap aims at improving the Mpps (e.g. reduce the average overhead per packet), while standard Linux network stack aims at improving the Gbps (TCP throughput). Use-cases are different: netmap is good for NFV/middlebox/ISP applications, Linux network stack (including virtualization components like vhost, virtio, etc.) is good for socket TCP applications.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/luigirizzo/netmap/issues/387#issuecomment-341703352, or mute the thread https://github.com/notifications/unsubscribe-auth/AEsSwR75i35Z90n90wCQve3DMchM7Qcmks5syxVTgaJpZM4QNp54 .

-- Vincenzo Maffione

snowcherry commented 7 years ago

The new driver is in /lib/modules/$(unamr -r)/updates/drivers/net/ethernet/intel/ixgbe, the wired thing is the unpatched one has the same version shown in ethtool -i eth0 -- 5.3.3, the built in one is a lower version, I will check if I can reproduce this and open another issue if yes.

So in a cloud environment, I can use virtio + vhost + bridge for users workload VMs, and netmap + ptnet for NFV function VMs like firewall and router. But I still have 3 questions need your advice:

About layer4 load balancer such as haproxy, assume there are ones implemented with kernel stack and netmap, which type is better to choose.
What I also need is vxlan, the hypervisors launching NFV VMs may need to implement a netmap based vxlan to communicate normal VMs with kernel vxlan, this is based on UDP, will it better if hardware offloading can be used or it won't matter much.
Back to the normal VMs on virtio + vhost + bridge, is the performance still desirable when it is hooked to iptables on hypervisor compared to ptnet + userland ip filtering?

The commands I used are:

On Hyyper A:
1. qemu-system-x86_64 xenial.img -enable-kvm -smp 4 -m 4G -vga std -device ptnet-pci,netdev=data10,mac=00:AA:BB:CC:0a:06 -netdev netmap,ifname=vale1:10,id=data10,passthrough=on -vnc 0.0.0.0:1,password -monitor stdio
2. vale-ctl -a vale1:eth0 # All offload were disabled according to README, and eth0 is in promisc mode
3. In side of VM: iperf -c 172.10.10.5 -P 2 -i 3 -t 60

On Hyper B:
iperf -s
The receiving command is performed on physical machine directly because this shouldn't affect the sending side.

vmaffione commented 7 years ago

1 --> I don't know. Many netmap users have proprietary applications, so we don't have benchmarks. We have an lb load balancer program (check apps/lb/), which is not sophisticated and you can extend it. In general a netmap load balancer application is expected to be faster than a traditional kernel-stack-based one, because of batching, zerocopy, light data strucutres etc. (the usual advantages of userspace networking frameworks). 2 --> In general you need to use netmap everywhere if you want to benefit from netmap. Mixing netmap and kernel stack in your fastpath does not usually makes sense, and it is hardly an optimal solution (to the contrary, it can be worse than just using the kernel stack alone). If you want to use VXLAN for your VMs you need to implement a netmap application doing the VXLAN tunneling (at least for the fastpath/datapath) and using the NIC in netmap mode to send/receive to/from the tunnel. You need to try to see if it's worth over kernel+offloadings (it should be because of batching etc.). 3 --> It clearly depends on how much iptables filtering you do. There is no way to know the answer without trying.

Thanks for the command, I'll try to reproduce your issue when I get back to office.

snowcherry commented 7 years ago

Thanks for you answers, very helpful for me. For the vxlan tunneling, if I use netmap for the fast path, is it the right approach to use the netmap patched veth pair -- one end is opened by netmap vxlan program, and the other end is tapped into the linux bridge? Or I should use tap device?

vmaffione commented 7 years ago

No, it's not possible, both ends of the veth must be used in netmap mode (current limitation, but point is that it does not make sense to support the mixed case). Once again you are trying to mix netmap (vxlan netmap program writing in veth1) with kernel stack (veth2 attached to linux bridge). This is not a good idea, as explained above.

The riigh approach here is to let your netmap program read traffic from VM (using a VALE switch, or a pipe), do the tunneling and sending the traffic to the NIC. You use netmap everywhere. No linux bridges, no taps, no veths.

snowcherry commented 7 years ago

But then there is no normal VMs, you have suggested that normal VMs with TCP endpoints to use vnet + vhost, then I want to find a way to combine them. I don't want to mix if ptnet has comparable performance assume the bug mentioned here is fixed, but still in a cloud, we need to support all kinds of images, a customer image may not have netmap module. netmap + virtio hasn't enough performance from what I tested, anyway, I'm going to write some code to try out, I'm sure tap device can do the trick, just not sure about the performance. The importance is I don't need one VM to achieve the best throughput, what I need is all VMs on the same hypervisor together can gain better throughput with less cpu usage. Will let you know how it goes, might be a failure. Thanks for all your help anyway.

vmaffione commented 7 years ago

To make it clear, you can use virtio-net in the VM and use a VALE port as a backend (without vhost), e.g.

qemu-system-x86_64 img.qcow2 -enable-kvm -smp 2 -m 2G -vga std -device virtio-net-pci,netdev=data1,mac=00:AA:BB:CC:01:01,ioeventfd=on,mrg_rxbuf=on -netdev netmap,ifname=vale1:1,id=data1

and so VALE can be a common software switch for all your VMs.

Now your goal is clear, but still you have the constraint of not having netmap in all the images (obviously). The optimal solution for you depends a lot on what percentage of your VMs have netmap inside. If you have many VMs with netmap, you could attach all the VMs to the VALE switch: netmap VMs can use ptnet, while the others can use virtio-net. If you use ptnetmap_tx_workers=0 you also reduce the CPU usage for the ptnet VMs, because you remove the I/O thread. If you have only a few VMs with netmap, it's probably not worth using netmap/VALE at all given your cost function and constraint.

vmaffione commented 7 years ago

Hi, ptnetmap instructions are now available at https://github.com/luigirizzo/netmap/blob/master/README.ptnetmap

vmaffione commented 7 years ago

Hi, the performance issue should be fixed on the current master (see #393). Could you please retry?

vmaffione commented 7 years ago

Closing this as it has been fixed in current master. Please reopen if you find it's not fixed yet.