luigirizzo / netmap

Automatically exported from code.google.com/p/netmap
BSD 2-Clause "Simplified" License
1.83k stars 533 forks source link

Netmap bridge has poor performance #594

Open hasanredzovic opened 5 years ago

hasanredzovic commented 5 years ago

Hi,

We build application using netmap API (netmap IP packet forward - NIPF), which is emulating data plane of routers. In other words, using multiple threads, NIPF receives, process and forward IP packets according to the content of IP routing table.

Now, we are trying to test performance of NIPF and as reference we wanted to use netmap example application - bridge. The netmap bridge is very simple application that only forward packet between two interfaces using one thread and it is ideal to compare with NIPF in order to see what is the performance impact of more complex packet processing functions.

However, we encounter the problem with our testing. Performance results were poor not only for NIPF but also for netmap bridge - First we tested NIPF and it achieved 5 Mpps per port. Then, we tested netmap bridge and measure similar results - 9 Mpps if packets are forwarded in one direction and 5 Mpps if bridge is forwarding packets in both directions. For packet generations we used DPDK pktgen.

We conducted same tests on different machines and measure the same results. Machines had different processors (Intel I7 and Intel Xeon - relatively new generation) and the rest of hardware specification was: 16 GB RAM, NIC intel 4 x 10 Gbit/s - i40e driver. Also, machines had different OSs (Ubutnu 18.10 (4.18.0-15-generic) and CentOS 7 (3.10.0-957.5.1.el7.x86_64). We used the most recent version of netmap.

The steps mentioned in README and LINUX/README files was followed in order to ensure optimal configuration of the netmap - for example "ethtool -K eth0 tx off rx off gso off tso off gro off lro off"...

Is this a normal performance for netmap bridge?

vmaffione commented 5 years ago

Have you disabled ethernet flow control on all the NICs?

ethtool -A ethX tx off rx off

What is the output of lsmod | grep netmap ?

hasanredzovic commented 5 years ago

Thank you for response! Yes, flow control was disabled. We used command "ethtool -A ethX tx off rx off". Bellow are listed checks of flow control configuration on interfaces used in test of netmap bridge:

input: ethtool -a enp3s0f0 output: Pause parameters for enp3s0f0: Autonegotiate: off RX: off TX: off

input: ethtool -a enp3s0f1 output: Pause parameters for enp3s0f1: Autonegotiate: off RX: off TX: off

Output of lsmod | grep netmap: netmap 172032 1 i40e

vmaffione commented 5 years ago

Ok, that's was just to check that you were using the netmap modified driver (and not the emulated adapter). The bridge application per-se is very simple, and it is just an example. However, it should not be the bottleneck. For example, if you bridge two pipe interfaces:

$ sudo bridge -i vale:x}1 -i vale:x{2

and then you generate traffic and sink with pkt-gen:

$ sudo pkt-gen -i vale:x}2  # sink
$ sudo pkt-gen -i vale:x{1  # generate

I can measure about 30 Mpps. Note that in this case I'm working with batch sizes of 512, which is very good. So the first thing you should check is that the average batch size for bridge (receive and transmit) is large enough. If it's not, bridge can be a boottleneck because of the syscall overhead. Batch may be low because of too many NIC interrupts. You can measure the interrupt rate from your NIC (both TX and RX), and maybe increase interrupt moderation with ethtool.

If batch is large enough, we can look further at performance issues in the driver.

vmaffione commented 5 years ago

Have you looked at CPU utilization of the bridge process? That can tell you if bridge is the bottleneck or not. Regarding NIC interrupts, it may also be that there are too few, and so packets are silently dropped in the NIC (in that case you should see low CPU utilization). Regarding the average batch, bridge does not measure that, so you should extend bridge to do that. Another thing you can do is to see if chainging the number of NIC TX/RX queues has effect: maybe too many or too few (ethtool -L)? Ditto for the number of TX/RX slots (ethtool -G), as too few slots may cause drops and too many may cause too much cache trashing. Also, you should try to see if passing --without-dmasync to /configure has an effect for you (it's an x86-specific optimization).

Note that you can use netmap pipes instead of physical interfaces (like in the example detailed in my last comment) to see the maximum rate that your application can process. Netmap pipes are very fast and zerocopy, so that the I/O cost is really low and the application bottleneck will show up.

hasanredzovic commented 5 years ago

Thank you once again for very informative response! Following your advice, we bridged two pipe interfaces and measure about 32 Mpps. Also, when we included --without-dmasync flag for /configure, we were able to achieve even higher packet throughput (37 Mpps).

We are familiar with balance between number of NIC Rx/Tx queues and the number of Rx/Tx slots per queue and we know how they can influence performance. Before our initial question here, we already tried different combination of Rx/Tx queues and number of slots, without noticeable difference in performance of the bridge. When we are already on the subject, I would like to share some results in our second experiment with mSwtich modules. We have testing setup with server which has 5 x Intel NIC with 4x10 Gbit/s (total of 200 Gbit/s). To keep this post short, I don't want to go too much into details about testing mSwitch modules, however I want to point out some things that can beneficial for netmap development:

  1. In order to achieve optimal performance, we need two use about 128, or 256 slots per queue, and 2 Rx/Tx queues per NIC interfaces. When we tried to configure more slots per queue or more queues, we experienced 30% performance loss. Our hypothesis was that at certain point of NIC interfaces x Rx/Tx quese x slots per queue, we overflow CPUs cache capacity, resulting in longer packet processing time. Now, we are not sure in exact packet processing path through system - is there Direct Cache Access, in which case packets are directly copied to CPU cache? Then, when we overflow CPU cache, packets are first sent to RAM which is slower. Also, we had hard time measuring CPU cache misses which are correlated with mSwitch modules.

  2. Opening 20 x 10 Gbit/s with mSwitch or with any other netmap application was impossible without modifying netmap code. We modifynetmap_mem2.c which, as we understand, has statically defined size of memory pool which become depleted before opening 20th NIC interfaces. In netmap_mem2.c we just increase the numbers that define the size of memory pool. This modifications were pure empirical - we continue modifying and increasing numbers until we were able to open all NIC interfaces. Maybe this information can help with improvingnetmap_mem2.c. If needed, I can send in other post our modification of netmap_mem2.c.

Now, back to the problem with thebridge. We modified bridge in order to track the sizes of batch per syscall. In case of pipe interfaces attached to the bridge, the batch size was always 512. Next, we measured batch sizes in case when bridge was using NIC interfaces. We tried to control the number of NIC interrupts (ethtool -C), by disabling adaptive interrupt moderation and setting interrupt rate to fix value of µs. Following figures show batch distribution within bridge for different configuration of interrupt rate (sorry, I didn't normalize y-axis).

image Fig. 1 - Rx/Tx queue size: 2048, adaptive interrupt moderation off, rx-usecs 128tx-usecs 128

For results shown in fig. 1, bridge achieved about 10 Mpps in one direction.

image Fig. 2 - Rx/Tx queue size: 512, adaptive interrupt moderation on

For results shown in fig. 2, bridge achieved about 10 Mpps in one direction.

image Fig. 3 - Rx/Tx queue size: 512, adaptive interrupt moderation off, rx-usecs 0tx-usecs 0

For results shown in fig. 3, bridge achieved about 8.7 Mpps in one direction.

image Fig. 4 - Rx/Tx queue size: 512, adaptive interrupt moderation off, rx-usecs 128tx-usecs 128

For results shown in fig. 4, bridge achieved about 10 Mpps in one direction.

image Fig. 5 - Rx/Tx queue size: 512, adaptive interrupt moderation off, rx-usecs 256tx-usecs 256

For results shown in fig. 5, bridge achieved about 8.7 Mpps in one direction.

To summarize, with adaptive interrupt moderation off and rx-usecs and tx-usecs set to too small or too high value, we had some performance downgrade which is to be expected. However, adjustment of these parameters didn't improve performance. Also, batch size distribution was changing, but the main portion of the distributions were around half of Rx/Tx queue size (number of slots in queue) which is normal?

Still, we don't have answer why are performance poor for bridgeand NIPF... I not sure what else we can try? Also, in next couple of weeks, after we finish some other tests, we will dedicated server for testing bridgeand NIPF, where we are planing on installing everything from fresh start, including OS. If we get the same results as now, we must conclude that there is nothing else to try and that this results are current performance capabilities of netmap apps.

vmaffione commented 5 years ago

Thank you very much for the deep analysis. This is very interesting. It is very clear from your pictures and numbers that batch size is large enough, and does not really limit your performance. Having batch size as half of the ring size is actually already optimal.

Minor point: --without-dmasync was meant to accelerate netmap on NICs (e.g. your bridge experiments). Did you left that enabled when doing these nice experiments?

Regarding CPU cache overflow, you are right. If your working set does not fit the LLC (or the L2 cache) you will see performance downgrade. Using shorter queues definitely makes sense with many interfaces. Regarding the limitations of netmap_mem2.c: yes, that part would need to be rewritten, but I'm not sure that you need to change the code. Have you tried to increase the allocator parameters in /sys/module/netmap/parameters? You should increase if_num, buf_num and ring_num as needed, that is the number of registered netmap interfaces, the number of buffers and the number of rings (in the global allocator, which serves the NIC ports). You also ring_size or if_size if needed.

I dont' think 10 Mpps per core is the limit of netmap apps, in general. As you have seen with pipes, you can reach more. Your analysis on batch shows that syscalls are not your bottleneck. But there are other things that may be. First, the CPU running bridge. Is that at 100% or less? If it's 100%, it may be worth using perf to check what is eating up most of your time. Second: the i40e netmap driver. There may be a misconfiguration, or regression there. Can you see how many Mpps you get with pkt-gen? E.g.:

pkt-gen -i ethX -f tx  # transmission, use DPDK pktgen on the receiver side

and

pkt-gen -i ethX -f rx  # receive, use DPDK pktgen on the sender side

In this way we can see how much we can push the NIC with netmap. You should get at least 20 Mpps with i40e.

Thanks