snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.95k stars 298 forks source link

Packet copies: Expensive or cheap? #648

Open lukego opened 8 years ago

lukego commented 8 years ago

Just wanted to capture some thoughts about whether packet copies are fundamentally expensive or cheap.

Packet copies are a recurring theme in networking in general and in Snabb Switch in particular. In the olden days packet copies were prohibitively expensive and had to be avoided. In the future it looks like Intel's SIMD work should make packet copies trivially cheap. Currently we are somewhere in between - but where exactly?

Thought experiment: Consider the extreme case of processing 64-byte packets.

AVX512 (2017)

Here is the assembler code to copy a 64-byte packet on a CPU supporting AVX512 (now expected on Xeon CPUs in 2017) as an "unrolled loop":

vmovdqu zmm0, [src]
vmovdqu [dst], zmm0

That is two instructions: load from memory into the 64-byte zmm0 register and then store the same value back to memory. These instructions are cheap: modern Intel CPUs can typically execute 2-3 such instructions per cycle. Note that the u in the instruction name says that these are unaligned memory accesses: Intel's latest engineering work has been to make unaligned SIMD operations fast so that programmers don't have to worry about it.

This looks like a really small and cheap bit of code. I find it hard to imagine trying to optimize this away even from an inner loop.

AVX2 (2014)

On an AVX2 CPU (Haswell) the situation is similar except that we need twice as many instructions because we only have 32-byte ymm registers:

vmovdqu ymm0, [src]
vmovdqu [dst], ymm0
vmovdqu ymm0, [src+32]
vmovdqu [dst+32], ymm0

This is four instructions now but these are also cheap (Haswell can execute more than one such instruction per cycle - see Agner Fog's instruction tables).

This also seems like a really tiny amount of work for a CPU to do. Hard to imagine this being a performance hotspot during packet processing.

Sandy Bridge (2009)

On AVX CPUs (Sandy Bridge) the situation is a little more complex. Certain instructions only support 16-byte registers and unaligned memory accesses are more expensive. However: these CPUs are dying out and it seems reasonable to expect that a new networking product developed today will rarely be deployed on a Sandy Bridge CPU.

Summary

Here is how it seems to me:

Why is this important?

In the past we have done a lot of really complex engineering work to eliminate copies. Nikolay and I have lots of impressive scars from our fights to make this work between Virtio-net and Intel 82599. Then we embraced the straightline design and SIMD instructions and we were able to delete all the complex copy-avoidance code while preserving overall system performance.

We seem to be in a similar situation again now. We are optimizing our programs based on the notion that packet copies are expensive -- not super expensive but too expensive to ignore. For example in the Virtio-net driver (#645) we are debating whether and how to eliminate packet.preprend() calls and in the lib.protocol overhaul (#637) we are making efforts to combine copies together to improve performance.

So the question on my mind is: Can we improve our copy routine so that it is "so fast it doesn't even matter" and free ourselves from these concerns? If not then why not? I have no mental model for why moving a small packet in memory must be an expensive operation on the platforms we are targeting.

This idea becomes a little more actionable now that DynASM has added support for SIMD instructions (luajit/luajit#95) and we are increasing our assembler programming sophistication.

kbara commented 8 years ago

It definitely still is an expensive operation; for the lwaftr, we had to ditch using lib.protocol entirely on our hot path, though we kept it for the less common operations (it does make the code nicer). This wasn't entirely due to packet copying overhead, but that was a double-digit percent of the massive speed difference it made. I'm currently writing lib.protocol style code for ARP, in the hopes that @alexandergall 's code to avoid multiple moves lands - and because ARP is a rare enough thing that it's not fatal if it doesn't. My ICMP mini-library, from before #637 , shuns the lib.protocol style entirely (so I need to figure out how to rewrite it to try to upstream it). The broader question of ultra-fast copy routines is rather interesting.

lukego commented 8 years ago

It definitely still is an expensive operation

For what value of "it" though?

I do realize that calling ffi.copy() is expensive. I don't think it follows that copying is an expensive operation though. Sounds more like there is a bug in our memcpy that we are spending a lot of effort to work around and should fix instead. (This would benefit the reworked lib.protocol too since that is still doing a move at the end.)

One reference data point is #603 where I ported some app code to assembler and see an average of 4.6 cycles per processing step (one of which includes a 64-byte copy to duplicate a packet). This code is a bit artificial but I don't immediately see why our normal packet copies can't be in that performance ballpark?

lukego commented 8 years ago

(Me talking too much and not giving other people much room to chime in... but hard to resist...)

I wonder what hypotheses we have for why copies take surprisingly much CPU time today and how we could test them?

Here are a few hypotheses:

  1. The actual data movement instructions take a long time to execute. (Could try to reproduce with hand-written assembler routine?)
  2. The GLIBC memcpy code itself is expensive e.g. incurs branch misses when checking edge cases. (Could benchmark it with pure C code and/or profile it with perf?)
  3. The ffi.copy causes a compiler load/store barrier that forces the JIT to generate a bunch of extra instructions to move data from registers into memory and back again. (Could read LuaJIT source or inspect generated code?)
  4. The ffi.copy makes a C function call with an expensive calling convention that forces the JIT to generate a bunch of extra instructions to save registers on the stack and restore them. Especially SIMD registers that LuaJIT does use and that are all callee-save. (Could inspect the generated code to see?)
  5. Some kind of cache effect. Could be that the packet data is not in L2 cache and we have to take an L3 access before the copy instructions can be retired? (Could profile with pmu module and look for L2 cache misses?)

That's everything that I can think of right now. I suppose it has to be one or more of those reasons -- or what else could it be?

kbara commented 8 years ago

For the value of "it" of "I'm using the lib.protocol stack, and it's shifting memory around 3 times. Using it to make headers and manually copying them is faster (5 MPPS on one arbitrary benchmark); writing the bytes without involving lib.protocol is faster yet" (8 MPPS on the same arbitrary benchmark) Unfortunately, I don't have the speed before those changes to hand, but I think it was around 1 or 2 MPPS on that same benchmark on the same hardware. While this is vague enough not to be particularly useful, having a factor of 1.5-2 difference by avoiding lib.protocol is major, unfortunately. ( https://github.com/Igalia/snabbswitch/pull/22 has a few raw performance dumps on repeated runs on a number of different packets/codepaths, but with enough different code changes going on that it's not very informative; it was quick-and-dirty heuristics that made sense at the time).

We were mainly testing on 550-byte packets, though we had similar results with much larger/smaller ones. We didn't explore different memory copying operations, due to time constraints and getting enough performance without doing that, but you're right that that's rich ground. The libav community still considered copies something to be avoided at all costs when I was collaborating with them in summer 2014, for what it's worth.

I'd wonder about the performance impact of most of our packets not being a multiple of 64 bytes; there's a fairly long history of "type X of copy can really go fast, type X' which is similar is quite a bit slower". Things like overlapping data can also be an issue, like you've mentioned - if I want to shift a packet by 14 or 18 bytes as I add/drop a header, it's a bit messy.

I'm really looking forward to seeing the work that gets done on faster packet copies. I like being able to streamline code and have cheap copies; my programming tastes run to the pure functional. I've just seen enough writeups over the last decade of minor variations killing copying performance from the 1980s onward, to not count on it until I see it.

lukego commented 8 years ago

@kbara Yeah, I understand your perspective. I acknowledge there could be a good reason to avoid copies, I just don't think that we have identified it yet and so I still hope it does not exist.

I also see potential for us to solve this problem in ways that are not available to others. For example we could round up the size of our copies to a 64-byte boundary because we do have ample space available in the buffer. Having a small collection of special-case copy routines like "Copy left in 64 byte blocks" could potentially be a neater solution than avoiding copies all together.

Going down this rabbit hole could potentially pay other dividends too. For example if the performance cost turned out to be due to this kind of alignment issue then it could also be affecting our SIMD checksum routines. Those could also be "rounded up" to 64-byte blocks just by using one vmovdqu instruction to zero the unused bytes in the last word of data. (Tony Rogvall suggested that idea to me once, he is the guy who wrote the original SSE checksum routine that we are using.)

lukego commented 8 years ago

type X of copy can really go fast, type X' which is similar is quite a bit slower

@kbara That is an interesting observation and it reminds me of when I benchmarked the rte_memcpy routine from DPDK.

I saw a major speedup in our standard NFV benchmark when I tested with C.rte_memcpy() instead of ffi.copy() for moving data on virtio on a workload of 256-byte packets. However, the speedup seemed to disappear when I tested a workload that had average size of 256 bytes instead of each packet being exactly that size.

This seemed undesirable to me: I don't want to have better performance in benchmarks than in real life. However, perhaps with a trick like rounding up the copy size to a 16-byte boundary we could have that performance on all workloads? (I admit I didn't dig too deeply into this at the time but noted it as something to think about later.)

kbara commented 8 years ago

I hope it doesn't exist too. Your thoughts on padding to multiples of X bytes might well be quite useful, though I'd wonder empirically about things like overlapping copies (ie, every time we want to shift something within a packet). Like you saw, average vs exact sizes can make a massive difference; if we can sidestep even most of that (the unaligned instructions you mentioned + packet buffers being a multiple of the relevant size sound promising), it'll be a good thing.

lukego commented 8 years ago

One more thought:

ffi.copy() is an intrinsic in LuaJIT that can be compiled specially (eg for structs or constant-size objects) and only compiles into a call to memcpy if no such case matches. Potentially we could add more cases there, even with an extra 'mode' argument, if we wanted control over things like barriers and calling conventions and roundings-up.

lukego commented 8 years ago

Two more thoughts that occurred, one theoretical and one more practical:

Theoretical: Copying 550 byte packets should in the best possible case take 18 cycles with AVX2 (32 bytes / cycle) or 9 cycles on a future CPU with AVX512 and expanded cache bandwidth. If one would do a triple-copy to incrementally add headers then that would take 54 cycles on AVX2 or 27 cycles on AVX512. Hypothetically if you would want to process 10 Mpps with one 2.0 GHz core then your total budget would be 200 cycles per packet. The triple-copy would then burn 27% of the total processing budget with AVX2 and 13.5% on AVX512. That would definitely be a hotspot worth optimizing: especially on AVX2 which is the best we will have for a couple of years (and that's not even thinking about older CPUs).

This seems like a reasonable case against the idea treating packet copies as "too cheap to care about" at least for a certain class of interesting applications especially in the short term.

This also seems like a strong motivation for the lib.protocol overhaul (#637) to make fewer copies. Sorry, it was my bright idea of "let's pretend copies are free and see what happens" and it seems the effect is to limit uptake of lib.protocol which is unfortunate and needs to be fixed.

I said that I have a practical idea too: Let me write that on the PR for #637 instead though.

wingo commented 8 years ago

Ignorant questions: when you copy a packet off of a device using DMA, what is the cache effect? Is it as expensive as a full cache miss? Does the memory at the source DMA address get copied into the cache, and if so which caches? Similarly for writing, does writing a packet to a device via DMA cause anything to be written to cache or does it skip that step somehow and write directly to memory?

lukego commented 8 years ago

Good question about cache behaviour!

I don't know the answers with absolute certainty. I can explain my mental model. I would love to do some experiments with the PMU to verify/falsify/correct this.

Let us refer to my favourite diagram in the whole world: I/O architecture of a dual-processor Xeon server:

bewsmbrcaaa5nzd jpg-large

Packets are being DMA'd over the PCIe links (red arrows at the bottom). Each PCI link is connected between one NIC and one CPU (NUMA node). The hardware feature that performs the DMA is Intel DDIO. DDIO transfers the data directly between the NIC and the L3 cache of the specific processor that the NIC is attached to.

This means that a process running on the same CPU as the NIC will access the packet data via the local L3 cache while a process running on the other CPU will access packet data remotely via the QPI link. (That long trip across the QPI link is where the "bad NUMA penalty" comes from AFAIK.)

(Reading in the Intel DDIO FAQ that I linked above) it seems that if the working set can be kept within L3 cache then the packet data need never be written out to DRAM. I also read there that Intel are planning to improve DDIO so that you don't have to worry about NUMA affinity. That would sure be nice: I bet it happens right after everybody is done expensively making all of their software NUMA-aware :).

wingo commented 8 years ago

Thanks for the DDIO link!

Interesting that keeping the packets in L3 is a function of how much packet data there is. With average-size 500-byte packets you then have 2000 packets or so per megabyte, and probably you have 2 MB L3 per core... though increasing the number of packets per breath might improve throughput up to a point it's interesting to see that at some point it would diminish throughput by causing more memory traffic for in-flight buffers. Dunno if this makes sense though :)

lukego commented 8 years ago

@andywingo We have seen substantial variations in throughput on snabbnfv by tweaking how many packets the intel10g app introduces in each breath. (Diego has looked into this too.) 100 packets seems pretty optimal and 200 is measurably slower. My theory is that we are looking for the sweet spot between too many context switches (too few packets per breath) vs the workload spilling out of L2 cache during processing (too many packets per breath). Could be an L3 cache issue instead however.

The ultimate solution to this kind of issue is supposed to be #616 which will simply report the number of L1/L2/L3 cache misses per packet for each app so that we don't have to guess. I have a few too many things competing for my attention at the moment to really finish that.

fmadio commented 8 years ago

dont forget all that complex branchy logic / smarts on deciding to copy the packet, part of the packet, or whatever can confuse the hw data prefetcher. For my own usage the effectiveness of the prefetcher has been the primary optimization/profiling goal, not the theoretical minimum instruction throughput.

gonzopancho commented 8 years ago

it seems that if the working set can be kept within L3 cache then the packet data need never be written out to DRAM. I also read there that Intel are planning to improve DDIO so that you don't have to worry about NUMA affinity.

Xeon E5-2xxx processors (Sandy Bridge/v1, Ivy Bridge/v2, and Haswell/v3) have PCIe interfaces on each processor (normally expressed these days as a 'socket', in order to express a collection of cores and associated circuitry on a discrete substrate of silicon and ceramic.) Whether these are all exposed on the motherboard depends on the vendor. In any case, if the NIC is on socket N, and the processing happens on any socket other than N, then the DMA (even DDIO) is going to have to happen across some intra-socket link.

QPI gets replaced in the "Skylake" CPUs with UltraPath Interconnect (UPI). UPI will operate at 9.6GT/s or 10.4GT/s data-rates and will be considerably more efficient than today’s QPI as it will support multiple requests per message.

So it will be faster, but won't eliminate the problems of QPI.

I think NUMA affinity will be with us for a long time to come.

lukego commented 8 years ago

I think NUMA affinity will be with us for a long time to come.

Yeah. NUMA will also surely affect a lot of different applications in different ways.

On Snabb Switch we are careful about NUMA because in practice we see ~1/3 loss of performance when processing packets from a "remote" NIC. I currently attribute this to additional latency when loading packet data from remote L3 cache vs local L3 cache. I am not really sure that this assessment is correct or complete though.

If that is the whole story though then Snabb Switch may not have to worry about NUMA affinity in the future. Intel's DDIO FAQ (linked above) says in Q6 that they are going to support DDIO DMA into the remote L3 cache in future processors. Then it should not matter which socket the NIC is wired to if the only performance issue is/was L3 cache latency.

Life would become simpler and simpler if the CPUs provide more robust performance and free us from details like NUMA affinity, SIMD alignment, and so on.

I am not sure whether we will ever really need to worry about bandwidth on any of these links. Currently it seems to me like HPC workloads are an order of magnitude heavier than networking ones and so we have a free ride there.

dpino commented 8 years ago

With regard to the performance problems we had in the implementation of the lwaftr due to packet copies, besides packet copy another reason for the bad performance was having to allocate/deallocate several lib.protocol objects per packet operation. What the code refactoring did by using ethernet, ipv4 and ipv6 struct headers instead directly was two things: 1) optimize packet copying by avoiding unnecessary operations, as it was mentioned and 2) avoid excessive object lib.protocol.{eth,ipv4,ipv6} allocation/deallocation per packet.

I observed that for instance in the implementation of VPWS (apps/vpn/vpws.lua) there are several lib.protocol objects (ipv6, gre, ethernet) instantiated in the constructor. The value of these objects never change. They are reused and copied into a new datagram per push operation. However, in our case the data these headers contain vary per push operation, so we needed to create new ethernet, ipv4 and ipv6 lib.protocol objects per push.

Here's a diff of one of the code refactorizations: https://github.com/Igalia/snabbswitch/commit/3a357b2365ce24c900d6800888ea8e9fbe427f74

The add_ipv6_header and add_ethernet_header functions, removed in this commit, instantiated new lib.protocol.ipv6 and lib.protocol.ethernet objects respectively.

alexandergall commented 8 years ago

Creating new objects in the processing loop is definitely a no-go. One way to avoid actual allocations is to call the object's free() method before exiting push(), which will put the instance on a free list. The next call of the consructor returns an object from that list, which will keep the code free of garbage. The list manipulations still introduces some processing overhead, of course.

PR #637 extends the framework to allow immediate "recycling" of an object instance by promoting the standard constructors (new, new_from_mem) to instance methods. Calling new on an existing object is equivalent to calling free followed by a call to the new class method but avoids going through the free list.

The upshot is that you can pre-allocate a protocol object and use it in the push loop with fairly low overhead.

Here's an example from my current tunneling code: https://github.com/alexandergall/snabbswitch/blob/vpn/src/program/l2vpn/tunnels/l2tpv3.lua#L40 Also note that I don't use parse() and pop() for decapsulation but pop_raw for efficiency, because in this case I already know the type of header.

dpino commented 8 years ago

FWIW, our previous code followed that principle, using free() on the lib.protocol objects after push. Here it's an example: https://github.com/Igalia/snabbswitch/blob/3a357b2365ce24c900d6800888ea8e9fbe427f74/src/apps/lwaftr/lwaftr.lua#L87). However performance improved if we skipped object creation and use the struct headers to cast and manipulate packet data directly. Maybe we were doing something wrong, dunno.

With regard, to pop_raw, it's what we used too for decapsulation.

alexandergall commented 8 years ago

The new method (from PR #637) should perform better than free(). Maybe you want to give that a try. I think my app went from ~1Mpps to ~6Mpps with this method and some other optimisations (like pop_raw vs parse``pop). If you want ultimate speed, nothing beats twiddling the bits directly, of course :)

dpino commented 8 years ago

Yes, sure. We would love to give the lib.protocol classes a try after the overhaul. I prefer to use a proper class than to manipulate the packets directly, and it's better for the whole project itself. 1Mpss to 6Mpps, really good improvement :)

lukego commented 8 years ago

Just noodling around: I sketched some assembler code for how a special-cased memcpy for packet.shiftleft() might look. This is not compiled or tested let alone benchmarked. Here is it anyway:

-- API: void packet_copy_backward_64(void* dst, void* src, int size):
--
--   Copy data backward in memory in 64-byte blocks.
--   Size will be rounded up to a multiple of 64.
--   Intended for packet-sized data.
-- 
-- The block size of 64 bytes is intended to be reasonable for all the
-- current/next/previous generations x86 SIMD (AVX, AVX2, AVX512).

-- Haswell implementation based on AVX2 with 32-byte registers:
local function asm_haswell_packet_copy_backward_64 (Dst)
   |  -- Setup registers like this:
   |  --   rcx = 0        (counter)
   |  --   rdx = bytes    (limit for counter)
   |  --   rdi = dst      (base destination address
   |  --   rsi = src      (base source address)
   |  -- (x86-64 calling convention does most of the work in this case.)
   |  xor rcx, rcx
   |
   |  -- 64-byte copy loop:
   |1:
   |  -- copy first 32 bytes
   |  vmovdqu ymm0, [rsi+rcx]
   |  vmovdqu [rdi+rcx], ymm0
   |  -- copy second 32 bytes
   |  vmovdqu ymm0, [rsi+rcx+32]
   |  vmovdqu [rdi+rcx+32], ymm0
   |  -- test loop exit condition
   |  add rcx, 64
   |  cmp rcx, rdx
   |  jb <1
   |  ret
end

Question: How should we evaluate a function like this to decide whether it is good/bad/neutral?

gonzopancho commented 8 years ago

netmap uses this, which is similar.

/*
 * this is a slightly optimized copy routine which rounds
 * to multiple of 64 bytes and is often faster than dealing
 * with other odd sizes. We assume there is enough room
 * in the source and destination buffers.
 *
 * XXX only for multiples of 64 bytes, non overlapped.
 */
static inline void
pkt_copy(void *_src, void *_dst, int l)
{
        uint64_t *src = _src;
        uint64_t *dst = _dst;
        if (unlikely(l >= 1024)) {
                memcpy(dst, src, l);
                return;
        }
        for (; likely(l > 0); l-=64) {
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
        }
}
lukego commented 8 years ago

@gonzopancho Thanks for sharing! I wonder how they decided to use this instead of memcpy?

Gotta say that I find it really hard to guess the effectiveness of this code. So many variables that depend on compiler and CPU behavior that may be obvious to some people but not to me:

  1. If that compiles into 8-byte load/store operations (movq) then what throughput/cycle will Haswell achieve? (Just 8 bytes/cycle or will the processor combine the operations and get full 32 bytes/cycle?)
  2. Can GCC vectorize those moves as written? How about if you changed the inner loop to memcpy(dst, src, 64) that it will treat as a builtin/intrinsic (small constant size copy)? If so would you expect people to compile with -mnative to enable optimizations specific to their CPU microarchitecture (e.g. Haswell)?
  3. How much can you depend on compiler behavior? (Got to worry about consistent results for various versions of GCC, CLANG, and ICC?)

I know that an expert C system programmer will probably know the answer to all of these questions but it makes my head hurt :). LuaJIT optimization also makes my head hurt of course...

lukego commented 8 years ago

Just another passing thought re: testing:

For test purposes it would be really interesting to have some standardized distributions of packet sizes for the various averages that we want to test.

For example in practice I would say that 500 bytes mean packet size is quite representative of consumer internet traffic, but it is likely to be a bimodal distribution with many large packets (1500 byte carrying e.g. HTTP/TCP payload) and many small packets (ACKs, DNS req/res, etc). This could be significant. Looking at the excerpted netmap code they use a completely different memcpy routine for packets over 1024 bytes and a naive benchmark setup could miss that code path completely (and such separate code paths can also exist internally in e.g. GLIBC memcpy).

In the past working as a vendor we have always had some test vectors based on real traffic from the networks we are targeting that we have obtained with permission. I am sure there are research projects that have taken representative traces and anonymized them and made them available to share freely. Anybody have a reference?

gonzopancho commented 8 years ago

They probably measured it (actually, I know they did) and found it faster than the bcopy()/memcpy() in the freebsd and linux kernels.

Using clang on freebsd 11, jim@x230:~ % clang -v FreeBSD clang version 3.7.0 (tags/RELEASE_370/final 246257) 20150906 Target: x86_64-unknown-freebsd11.0 Thread model: posix

and cc -S, the inner loop that is generated is thus:

.LBB0_3:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
    movq    (%rax), %rcx
    movq    %rcx, (%rsi)
    movq    8(%rax), %rcx
    movq    %rcx, 8(%rsi)
    movq    16(%rax), %rcx
    movq    %rcx, 16(%rsi)
    movq    24(%rax), %rcx
    movq    %rcx, 24(%rsi)
    movq    32(%rax), %rcx
    movq    %rcx, 32(%rsi)
    movq    40(%rax), %rcx
    movq    %rcx, 40(%rsi)
    movq    48(%rax), %rcx
    movq    %rcx, 48(%rsi)
    movq    56(%rax), %rcx
    movq    %rcx, 56(%rsi)
    addl    $-64, %edx
    addq    $64, %rax
    addq    $64, %rsi
    cmpl    $64, %edx
    jg  .LBB0_3

compiling with -O3 (clang has -O4 == -O3) and using gdb to disassemble:

(gdb) disassemble pkt_copy 
Dump of assembler code for function pkt_copy:
0x0000000000000000 <pkt_copy+0>:    push   %rbp
0x0000000000000001 <pkt_copy+1>:    mov    %rsp,%rbp
0x0000000000000004 <pkt_copy+4>:    mov    %rdi,%rax
0x0000000000000007 <pkt_copy+7>:    cmp    $0x3ff,%edx
0x000000000000000d <pkt_copy+13>:   jg     0x70 <pkt_copy+112>
0x000000000000000f <pkt_copy+15>:   test   %edx,%edx
0x0000000000000011 <pkt_copy+17>:   jle    0x6e <pkt_copy+110>
0x0000000000000013 <pkt_copy+19>:   add    $0x40,%edx
0x0000000000000016 <pkt_copy+22>:   nopw   %cs:0x0(%rax,%rax,1)
0x0000000000000020 <pkt_copy+32>:   mov    (%rax),%rcx
0x0000000000000023 <pkt_copy+35>:   mov    %rcx,(%rsi)
0x0000000000000026 <pkt_copy+38>:   mov    0x8(%rax),%rcx
0x000000000000002a <pkt_copy+42>:   mov    %rcx,0x8(%rsi)
0x000000000000002e <pkt_copy+46>:   mov    0x10(%rax),%rcx
0x0000000000000032 <pkt_copy+50>:   mov    %rcx,0x10(%rsi)
0x0000000000000036 <pkt_copy+54>:   mov    0x18(%rax),%rcx
0x000000000000003a <pkt_copy+58>:   mov    %rcx,0x18(%rsi)
0x000000000000003e <pkt_copy+62>:   mov    0x20(%rax),%rcx
0x0000000000000042 <pkt_copy+66>:   mov    %rcx,0x20(%rsi)
0x0000000000000046 <pkt_copy+70>:   mov    0x28(%rax),%rcx
0x000000000000004a <pkt_copy+74>:   mov    %rcx,0x28(%rsi)
0x000000000000004e <pkt_copy+78>:   mov    0x30(%rax),%rcx
0x0000000000000052 <pkt_copy+82>:   mov    %rcx,0x30(%rsi)
0x0000000000000056 <pkt_copy+86>:   mov    0x38(%rax),%rcx
0x000000000000005a <pkt_copy+90>:   mov    %rcx,0x38(%rsi)
0x000000000000005e <pkt_copy+94>:   add    $0xffffffffffffffc0,%edx
0x0000000000000061 <pkt_copy+97>:   add    $0x40,%rax
0x0000000000000065 <pkt_copy+101>:  add    $0x40,%rsi
0x0000000000000069 <pkt_copy+105>:  cmp    $0x40,%edx
0x000000000000006c <pkt_copy+108>:  jg     0x20 <pkt_copy+32>
0x000000000000006e <pkt_copy+110>:  pop    %rbp
0x000000000000006f <pkt_copy+111>:  retq   
0x0000000000000070 <pkt_copy+112>:  movslq %edx,%rdx
0x0000000000000073 <pkt_copy+115>:  mov    %rsi,%rdi
0x0000000000000076 <pkt_copy+118>:  mov    %rax,%rsi
0x0000000000000079 <pkt_copy+121>:  pop    %rbp
0x000000000000007a <pkt_copy+122>:  jmpq   0x7f
End of assembler dump.

Given that RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8-R15 are the labels of the 16 64-bit registers, I think we have an answer.

And it's not dissimilar to the fairly modern code in glibc: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/lib/memcpy_64.S?id=HEAD

the body of the 'forward' loop is

.Lcopy_forward_loop:
    subq $0x20, %rdx

    /*
     * Move in blocks of 4x8 bytes:
     */
    movq 0*8(%rsi), %r8
    movq 1*8(%rsi), %r9
    movq 2*8(%rsi), %r10
    movq 3*8(%rsi), %r11
    leaq 4*8(%rsi), %rsi

    movq %r8,   0*8(%rdi)
    movq %r9,   1*8(%rdi)
    movq %r10,  2*8(%rdi)
    movq %r11,  3*8(%rdi)
    leaq 4*8(%rdi), %rdi
    jae  .Lcopy_forward_loop
    addl $0x20, %edx
    jmp  .Lhandle_tail

If you assume AVX, then glibc has already done the work for you

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S;h=9f033f54568c3e5b6d9de9b3ba75f5be41070b92;hb=HEAD

dpdk has some good work: http://dpdk.org/browse/dpdk/tree/lib/librte_eal/common/include/arch/x86/rte_memcpy.h and a test harness: http://dpdk.org/browse/dpdk/tree/app/test/test_memcpy_perf.c

This is ancient history, but when I was at Convex (1986-ish?), it was trivial to write a short 'C' program that would saturate the memory bus. A couple 'for' loops would do it. Of course, Convex had a compiler that would vectorize the loops...

Netmap, of course, has a lot of architecture to avoid copies. Avoided copies are the fastest copies.

lukego commented 8 years ago

@gonzopancho Thank you for taking the time to follow different projects and share ideas between them. I am really glad that so many projects are able to explore different kinds of networking software designs (Linux, BSD, netmap, DPDK, Snabb Switch, ...). This seems much more productive than everybody doing things the same way, at least if we are able to get the benefit of each others' experiments and experiences.

I reckon we need to expand our performance regression test suite to be able to resolve all questions about the relative merits of this sort of code. Then we merge all changes that simplify the code while preserving performance and we merge only those optimizations that demonstrably improve performance without excessive complexity. I have some new ideas on this that I will write up separately.

gonzopancho commented 8 years ago

Cool. Your "merge all changes while preserving performance" is literally the theme of the paper that George Neville-Neil and I did: "Measure twice, code once". Of course, preserving performance starts with observing performance.

https://github.com/gvnn3/netperf https://github.com/gvnn3/conductor

I think it's likely that the finding will be that the copy is cheap, but its effect on the cache (pollution of same) may be quite expensive.

lukego commented 8 years ago

@gonzopancho Thanks for the links! cc @eugeneia.

We are also investing a bunch of effort in end-to-end performance tests on our CI e.g. booting VMs and driving traffic between them and a physical network. These repeatable benchmarks have allowed us to retire a boatload of complex optimizations already. I love those kind of activities :).

I think it's likely that the finding will be that the copy is cheap, but its effect on the cache (pollution of same) may be quite expensive.

This would be a valuable and actionable result. The copies that are controversial in our code base right now have their source and destination addresses in the same cache lines. They are moving data in-place while modifying packets (e.g. adding and removing encapsulations).

This kind of copy can simplify our code base, in particular allowing us to represent packets in a canonical form, but they can also create complexity elsewhere if people feel the need to code around these operations because they are not fast enough. If that happens then we need to either optimize them more or create a new mechanism e.g. adding an indirection so that the start address of a packet can be modified without touching the data.

fozog commented 7 years ago

Hello!

In the case of packet copies, the instructions are not that important. The internal architecture of the processor has Execution Units. Among those, you have load/store "ports" that limit the number of parallel requests to memory: http://www.anandtech.com/show/6355/intels-haswell-architecture/8 http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5

For Skylake, as far as I know, you can only 2 memory operations in parallel, not 3, and this is valid for the whole core, not per thread. The net result is when you copy all packets for 148Mpps of a 100Gbps at line rate, hyperthreading is useless, worse, you loose "turbo" benefits.

But execution core are not the only limits. Cache is a precious resource and copying reduces by a factor of two the active dataset. Then you have to ensure proper positioning because there are also other issues related to cacheline placement in an Intel processor. A hash is calculated and a cacheline is stored on the relevant CBo (Cachebox: http://images.anandtech.com/doci/8423/HaswellEP_DieConfig.png?_ga=1.10720884.1957720941.1472434760). So the CBo itself is a limitation. In addition, you reach the CBo by moving the cacheline in store and forward mode along the dual ring(s) of the processor. In large core count (>16) there are two dual rings interconnected. So storing a cacheline may further incur other costs through crossing dual ring bridge. I made measurements 4 years ago (may be very different now) and packet copy latency and throughput can be impacted as much as 30% depending on the core you use and the CBo set that happen to be mostly touched...

Lastly, cacheline activity may incur eviction, which also touches on the limits of the Integrated Memory Controllers... There is a queue depth of 10 requests to memory per channel.... So the theoretical limit would be 16 cores doing 2 memory requests in parallel on a 4 channel system with perfectly balanced memory locations. But based on the DIMMM timings to answer, the activity of the prefetcher, you cannot have that many parallel memory transactions. I never made a scalability measurements for memory bandwidth and latency of this type, so I don't have data to share.

Another angle would be to use non temporal hints to leverage "Write Buffers" instead of cachelines, but that is also a scarce resource of the processor.

Bottom line, for packet copy impact evaluation, instruction count is the tip of the iceberg problem (you may not even know if REP MOVSB does not translate into more efficient micro-instructions than your two AVX512 instructions).

For high performance networking, memory subsystem is key. The early generation of Xeon E5-4600 had a very bad "Home Agent" that routed memory request along a uni-directional ring among the 4 NUMA nodes. So the cost to access remote memory was incredibly high. So to assess the impact of packet copy, you must figure all the details of the life of a cacheline outside the instructions themselves. And you can't stay at high level and picture a 20MB L3 cache anymore: you have independent CBos interconnected by a dual ring...