Strange performance bottleneck with 2 interfaces

desbma commented 7 years ago

I have a small program similar to pkt-gen in rx mode, but a lot simpler, which captures traffic on two 10Gb interfaces (Intel XL710) and count received frames.

I am using latest Netmap from master branch, Linux kernel 4.4 and i40e driver v1.5.25.

I am seeing a performance bottleneck that I can not explain:

when I send 1s of traffic at 10Gb/s (only 64 B frames) to a single interface, everything is captured (0 loss)
when I send 1s of traffic at 10Gb/s (only 64 B frames) to both interfaces, only about ~65% of frames are captured on both interfaces

Since the interfaces are independent, and so are the associated Netmap data structures, I expect the performance to be the same with several interfaces. I do not read the frame payload, so it cannot be a memory bandwidth limitation.

My threads have no shared data, and the stat counters for each thread are on a different CPU cache line. To be sure I have tried using 2 processes which each capture on a single interface and I see the same performance drop. Each thread is pinned to a single free core on the same NUMA node as the PCI devices.

I have made tests with :

1 RSS queue and 1 reader thread per interface
2 RSS queues and 2 reader threads per interface
3 RSS queues and 3 reader threads per interface

And I only see a slight performance increase, which stops with more than 3 threads.

What I am missing and why can't I have 0% loss on 2 x 10Gb/s interfaces ?

Thanks

vmaffione commented 7 years ago

I think you may have saturated your pci bus, which may not be able to run 30 millions of packets per second, with each packet requiring a pci transaction for the payload and a transaction for a batch of descriptor, plus sporadic accesses to device registers.

vmaffione commented 7 years ago

You may try to play with increasing packet size, And or move your nics on pci slots with more lines, if available.

desbma commented 7 years ago

It can not be a PCI bus limitation, because with other zero copy frameworks (PF_Ring and DPDK) I have managed to get 0 or <1% of loss with similar tests.

vmaffione commented 7 years ago

Mm ok but what is the cpu utilization that you see in both cases? It should be very low, which means it cannot be a cpu limitation. You should also try to measure the interrupt rate, e.g. /proc/interrupt.

desbma commented 7 years ago

I use active polling (loop on the NIOCRXSYNC ioctl) because in my tests this improves performance by about 500 Mb/s for 64 B frames, so CPU utilization is always 100% on the affected cores.

I'll check tomorrow the interrupt rates (I'm not in front of the test machine right now), but I remember they were very low compared to the non-netmap case.

vmaffione commented 7 years ago

Yes, you should also try to measure the average per syscall batch, that's very important to understand what is happening. Keep in mind that in general busy polling has negative effect on the avg batch size.

desbma commented 7 years ago

All right here are the results of my tests (all done with 64 B frames at maximum link speed) :

Interrupt rate for NIC RX interrupts, with single RSS queue and thread : ~370 captured frames/interrupt
Soft interrupt rate (NET_RX), with single RSS queue and thread : ~230 captured frames/interrupt
Removing active polling does not improve performance significantly
When using "hybrid" active/passive polling (poll call, then NIOCRXSYNC ioctl loop as long as packets are available, then poll again), and 3 RSS queues and threads per interface, cores are loaded by about 30% only, but I still get the same loss
When using a single interface, a single RSS queue and thread, and 30s of traffic, I manage to capture every single frame with active polling, and almost every frame (<1% loss) with hybrid polling

So clearly something is going on when using two interfaces. Is there some internal locking that could generate contention?

vmaffione commented 7 years ago

Interrupt rate should be ok. Anyway, you can try to play with "ethtool -C rx-usecs XXX" to limit the average rate as you wish, and see if this has same effect.

In terms of locking, each ring is protected by a separate lock in kernespace, and it is usually a logic error to let two threads access the same ring concurrently (the lock is there only to protect the hardware). This is to say that there shouldn't be any contention.

I still think it's something wrong with the PCI bus, or maybe some bad interaction with NUMA. You could try to increase packet size to see if and at which packet size the issue goes away.

desbma commented 7 years ago

Interrupt rate should be ok. Anyway, you can try to play with "ethtool -C rx-usecs XXX" to limit the average rate as you wish, and see if this has same effect.

I already tried to play with the various ethtool interrupt rate parameters, and they have no significant influence on performance with Netmap. My cores are mostly idle here if I use passive polling and enough cores/RSS queues, while in non Netmap mode, with a lot of traffic I see a huge "softirq" CPU usage on the cores that are mapped to the RSS queues interrupt lines. So I do not believe interrupts are the bottleneck here.

In terms of locking, each ring is protected by a separate lock in kernespace, and it is usually a logic error to let two threads access the same ring concurrently (the lock is there only to protect the hardware). This is to say that there shouldn't be any contention.

Is there any global Netmap lock apart from the ring lock? Is the nm_rxsync function called from the same task for both interfaces?

I still think it's something wrong with the PCI bus, or maybe some bad interaction with NUMA.

I'm only using cores (pinned with sched_setaffinity) on the same NUMA node as the PCI bus hosting the NIC.

You could try to increase packet size to see if and at which packet size the issue goes away.

I'm trying to test the worst case performance wise. As you know for a zero copy network framework this is the maximum frame rate (minimum frame size at maximum link speed). Of course if I increase the frame size, the frame rate at the same 2 x 10 Gb/s speed link is reduced and therefore there is no loss. I test frame sizes suggested by RFC 2544 and with size 124 (Ethernet header included, but FCS excluded), I have no loss at all. But that does not tell me anything about the weird 1 vs 2 interfaces difference.

I'm doing my tests on a high end Xeon server with 112 cores and 36 MB of CPU cache, and that performance should be easy to achieve as long as no packet copy occurs.

At first I thought the bottleneck was caused by the number of 4K memory pages used by the buffer pools, because in the same conditions I managed to have no loss with PF_Ring ZC and DPDK, which both use x86 huge pages (2MB for PF_Ring and 1GB for DPDK) to reduce the number of TLB cache misses. However, by running Netmap and my other test programs with perf -e dTLB-load-misses I do not see excessively high counters that could explain that difference.

EDIT : If you have any idea of test, profiling, perf counter etc. to check to investigate further, that would be great. Thanks!

giuseppelettieri commented 7 years ago

What do you see if you reduce the number of slots per ring (ethtool -G)? Note that you have to do this before starting netmap applications.

EDIT: other thing to try: assuming that you control the size of the incoming packets, reduce the netmap buffer size as much as possible, e.g.

# echo 64 > /sys/module/netmap/parameters/buf_size

(again, do this before starting the netmap application)

vmaffione commented 7 years ago

@desbma : I see your point, I just wanted to double check.

Of course testing the worst case is the right thing to do, but for the sake of debugging it could help trying to increase the packet lenght by 2 or 4 bytes (to see if there is some alignment problem), or increase more to see if the problem suddenly disappears after a certain size.

Anyway, netmap has an internal global lock, but just for configuration purposes, it is never used on the datapath. On the datapath, each ring uses a different lock, so rxsyncs/txsyncs on different rings never need to serialize, at least for hardware rings. There is a separate lock for the software ring, but from your description I doubt you are using it.

desbma commented 7 years ago

Thanks to both of you for your answers.

@giuseppelettieri

What do you see if you reduce the number of slots per ring (ethtool -G)? Note that you have to do this before starting netmap applications.

I usually have it set to the maximum (4096), because I remember seeing that it improves performance in some cases. I just tried:

2048 => no significant change
1024 => loss is roughly halved
512 => loss is now <0.01%
256 => loss is still<0.01%

Interestingly though, if I run a dichotomy on the bandwidth to find the maximum rate with no loss, I get exactly the same 2 x ~5.75 Gb/s value.

So maximum rate with no loss is the same with a 512 ring size, despite the loss at maximum link rate for small frames being considerably lower.

EDIT: other thing to try: assuming that you control the size of the incoming packets, reduce the netmap buffer size as much as possible, e.g.

echo 64 > /sys/module/netmap/parameters/buf_size

(again, do this before starting the netmap application)

I did as you suggested, and I now capture every single frame, even with a ring size of 4096.

I should also mention that I apply a small patch to bump memory limits because otherwise I can hit them in my tests (worst case memory wise is 4096 slots rings x 112 RSS queues x 2 interfaces :

--- netmap_mem2.c.orig  2016-12-21 10:24:54.000000000 +0100
    +++ netmap_mem2.c   2017-06-29 15:01:04.455354038 +0200
    @@ -422,7 +422,7 @@
                .objminsize = 64,
                .objmaxsize = 65536,
                .nummin     = 4,
    -           .nummax     = 1000000, /* one million! */
    +           .nummax     = 16000000,
            },
        },

    @@ -432,7 +432,7 @@
                .num  = 100,
            },
            [NETMAP_RING_POOL] = {
    -           .size = 9*PAGE_SIZE,
    +           .size = 20*PAGE_SIZE,
                .num  = 200,
            },
            [NETMAP_BUF_POOL] = {

@vmaffione

Of course testing the worst case is the right thing to do, but for the sake of debugging it could help trying to increase the packet lenght by 2 or 4 bytes (to see if there is some alignment problem), or increase more to see if the problem suddenly disappears after a certain size.

Sorry I misunderstood the purpose of the test you suggested. :) So I have tried the following frame sizes (Ethernet header included, FCS excluded), with 30s of traffic :

60 : about 40% loss
64 : no single frame lost
68 : about 33% loss
72 : about 30% loss
76 : about 28% loss
80 : about 25% loss
84 : about 21% loss
88 : about 19% loss
92 : about 17% loss
96: about 13% loss
100: about 11% loss
104 : about 7% loss
108 : about 6% loss
112 : about 3% loss
116 : no single frame lost

desbma commented 7 years ago

So to summarize:

reducing ring size reduces loss percentage, but does not fix the problem that maximum rate is lower with 2 interfaces than with 1
reducing slot buffer size fixes it completely (of course this is not a long term solution, but still interesting result)
the issue does not exist for frames of size 64

I admit I have a hard time interpreting theses results, especially the last one. Since the frame payloads in the slots are only accessed (written) by DMA, and the slot buffers are of fixed size 2048, why is the performance different with a particular frame size?

giuseppelettieri commented 7 years ago

Interesting results, thanks desbma.

The last point is probably related to the problem Luigi already explained in the original netmap paper: the DMA engine in the card is doubling the number of PCI transactions, trying to preserve the trailing cacheline bytes not overwritten by the incoming packets.

My idea for the first two points, instead, is that DDIO is trashing the cache. The netmap buffers are all aligned to 2K boundaries, and this may cause a lot of cache conflicts and bad cache utilisation. Reducing the ring size reduces the number of buffers competing for the same cache line. Making the buffers only 64 bytes long will let each one of them go into a different cache line.

I think that the proper way to fix this latter issue would be to add some cache-coloring to the netmap buffers. I have some ideas on how to retrofit this in some relatively unobtrusive way, but I have to experiment with that.

desbma commented 7 years ago

The last point is probably related to the problem Luigi already explained in the original netmap paper: the DMA engine in the card is doubling the number of PCI transactions, trying to preserve the trailing cacheline bytes not overwritten by the incoming packets.

Indeed, this seems to describe exactly what is going on. However, if the number of PCI transactions is the bottleneck, how come other factors like slot buffer size can still improve performance? Also how do PF_Ring and DPDK get around this limitation if it is purely a hardware bus limitation?

My idea for the first two points, instead, is that DDIO is trashing the cache. The netmap buffers are all aligned to 2K boundaries, and this may cause a lot of cache conflicts and bad cache utilisation. Reducing the ring size reduces the number of buffers competing for the same cache line. Making the buffers only 64 bytes long will let each one of them go into a different cache line.

I don't understand that either: 2048 is multiple of the cache line size (64), so how does that cause cache conflicts? Each incoming frame will get into it's own cache area, without affecting other frames, no?

I think that the proper way to fix this latter issue would be to add some cache-coloring to the netmap buffers. I have some ideas on how to retrofit this in some relatively unobtrusive way, but I have to experiment with that.

Well, if you need a tester to run the code of your testing branch, you can count on me. :)

giuseppelettieri commented 7 years ago

However, if the number of PCI transactions is the bottleneck, how come other factors like slot buffer size can still improve performance? Also how do PF_Ring and DPDK get around this limitation if it is purely a hardware bus limitation?

have you tried 64B slot buffer size with 60 bytes packets? That would eliminate the cache issue and leave only the PCI transactions issue.

don't understand that either: 2048 is multiple of the cache line size (64), so how does that cause cache conflicts? Each incoming frame will get into it's own cache area, without affecting other frames, no?

Well, two addresses conflict when their difference is a multiple of the cache size. Assume, for example, that all the buffers are sequential in memory. With 2048B buffers and a 2MiB cache you will get a conflict between buffer 0 and buffer 1024, buffer 1 and buffer 1025, and so on.

Well, if you need a tester to run the code of your testing branch, you can count on me. :)

There is one thing that you can try (and that I always forget). What do you see when you set the buffer size to 2112 (i.e., 2048+64)? (Note that this will probably be upgraded to 2176=2024+128 since NM_CACHE_ALIGN is set to 128 in sys/net/netmap.h; it should be beneficial anyway).

desbma commented 7 years ago

have you tried 64B slot buffer size with 60 bytes packets? That would eliminate the cache issue and leave only the PCI transactions issue.

I just did the test: I get 0 lost frames. So the PCI bus does not seem to be the cause.

Well, two addresses conflict when their difference is a multiple of the cache size.

You wrote "Reducing the ring size reduces the number of buffers competing for the same cache line.". If buffers are of size 2048, which is multiple of cache line size 64, and assuming the pool start adress is cache line aligned, then two different buffers can never compete for the same cache line, or am I missing something?

Assume, for example, that all the buffers are sequential in memory. With 2048B buffers and a 2MiB cache you will get a conflict between buffer 0 and buffer 1024, buffer 1 and buffer 1025, and so on.

What do you mean by conflict? If you access sequentially buffer at offset 0 and 1024, then 1 and 1025 etc. the space in between won't necessarly be loaded in the cache, only the cache lines at the offset accessed are guaranteed to be in it. Also I won't be surprised if the CPU detects the access pattern and proactively preloads cache lines at the offsets n+1.

There is one thing that you can try (and that I always forget). What do you see when you set the buffer size to 2112 (i.e., 2048+64)? (Note that this will probably be upgraded to 2176=2024+128 since NM_CACHE_ALIGN is set to 128 in sys/net/netmap.h; it should be beneficial anyway).

I get no loss.

Now I have a usable workaround, the i40e driver always gives chunks of maximum size 2048 even for jumbo frames, so there is no harm in using 2112 buffers apart from a little memory waste.

Still I don't understand this result since both 2048 and 2112 are multiple of cache line sizes. What is your interpretation?

Thanks again for your help.

giuseppelettieri commented 7 years ago

I just did the test: I get 0 lost frames. So the PCI bus does not seem to be the cause.

I think it still the problem that Luigi found, but maybe related to the cache-memory path rather than the PCI bus. In order to do a partial write into a cacheline, the DDIO mechanism will be forced to first fetch the cacheline from memory. This is of course expensive.

Still, I don't know how/why PF_RING and DPDK don't suffer from this. Maybe there is a way to tell the device that we don't want to preserve the value of those trailing bytes.

If buffers are of size 2048, which is multiple of cache line size 64, and assuming the pool start adress is cache line aligned, then two different buffers can never compete for the same cache line, or am I missing something?

OK, I think I see what you are missing. I am not referring to the fact that two buffers may share the same cache line. Since each buffer size is a multiple of a cache line, and base addresses are cache-aligned, this sharing is not possible.

No. The problem are conflict misses. The hash function that the cache uses to map memory addresses to cache lines is very simple: it uses the lower part of address >> 6 (assuming 64B cache lines). An n-way cache cannot contain more than n cachelines whose addresses hash to the same value, even if the cache is otherwise empty.

Also I won't be surprised if the CPU detects the access pattern and proactively preloads cache lines at the offsets n+1.

Of course it will, and in many cases this automatic prefetching does indeed help. But here it is not the CPU that is performing the accesses. Instead, it is the I/O device that is issuing PCI write transactions that target the cache (DDIO). You cannot prefetch I/O.

desbma commented 7 years ago

The hash function that the cache uses to map memory addresses to cache lines is very simple: it uses the lower part of address >> 6 (assuming 64B cache lines). An n-way cache cannot contain more than n cachelines whose addresses hash to the same value, even if the cache is otherwise empty.

So using 2112 as the slot buffer size instead of 2048 creates more variation in the lower bits of the address, which reduces hash collisions and allows better utilization of the CPU cache, am I understanding correctly?

giuseppelettieri commented 7 years ago

Yes, exactly.

luigirizzo / netmap

Strange performance bottleneck with 2 interfaces #336

echo 64 > /sys/module/netmap/parameters/buf_size