lukego / blog

Luke Gorrie's blog
565 stars 11 forks source link

Snabb data structures: packets, links, and apps #11

Open lukego opened 8 years ago

lukego commented 8 years ago

Software architectures can sometimes be summarized with a few key data structures.

Unix is about processes, pipes, and files. Processes are executing code, pipes are FIFO byte buffers, and files are binary storage.

Emacs is about text, buffers, and windows. Text is strings of characters with key-value properties, buffers are collections of text and positional markers, and windows are user-visible screen areas that display parts of buffers.

Snabb Switch is about packets, links, and apps.

(This post follows on from #10 Snabb Switch In a Nutshell and assumes that you already have the "app network" picture in your head.)

Packets

Packets are the basic inputs and outputs of Snabb Switch. A packet is simply a variable-size array of binary data. Packets usually contain data in an Ethernet-based format but this is only a convention.

struct packet {
  unsigned char payload[10240];
  uint16_t length;
}

Packets on the wire in physical networks are bits encoded as a series of electrical or optical impulses. Snabb Switch just encodes those same bits into memory. (We resist the temptation of more complicated representations.) Code references: packet.h and packet.lua.

Links

A link collects a series of packets for processing by an app. Links between apps serve a similar purpose to ethernet cables between network devices, except that links are unidirectional. Links are represented as simple ring buffers of packets.

struct link {
  struct packet *packets[256];
  int read, write; // ring cursor positions
}

Actually this is a slightly idealized view of the link. The real link struct also includes some counters that are incremented when packets are added, removed, or dropped because no space is available. I suspect we will transition to this simpler representation over time and that is why I show it here. Code refs: link.h and link.lua.

Apps

Apps are the active part of Snabb Switch. Each app performs either or both of these functions:

  1. "Pull" new packets into Snabb Switch by receiving data from the outside world (e.g. a network interface card) and placing them onto output links for processing.
  2. "Push" existing packets from input links through the next step of their processing: output onto a real network, transfer onto one or more output links for processing by other apps, perform filtering or transformation, and so on.

In principle an app is a piece of machine code: anything that can execute. In practice an app is represented as a Lua object and executes code compiled by LuaJIT. (This code can easily call out to C, assembler, or other languages but in practice it seldom does.)

{
  input  = { ... },     -- Table of named input links
  output = { ... },     -- Table of named output links
  pull   = <function>,  -- Function to "pull" new packets into the system.
  push   = <function>   -- Function to "push" existing packets onward.
}

Code reference: simple examples in basic_apps.

Summary

Those are the most important data structures in Snabb Switch. To do serious Snabb Switch development you only need to write some code that manipulates packets and links. Usually we write apps in Lua using some common libraries, but you can realistically write them from scratch in Lua, C, assembler, or anything else you care to link in.

There are more details of course: we will dig into those later.

pkhuong commented 8 years ago

Do you consistently put metadata at the end of structs? For the packet struct, ISTM we can expect a lot of sequential processing from the first to the last byte. In that case, it would make sense to put the length first… unless there are alignment issues with DMA? Even then, I'd put the length first + padding, unless you can guarantee huge pages.

I'd be wary of int for ring buffers: signed overflow is undefined.

lukego commented 8 years ago

I'd put the length first

Care to send that as a PR? The CI will automatically run benchmarks and it would be really neat if those are adequate to evaluate such a micro-optimization.

I'd be wary of int for ring buffers: signed overflow is undefined.

Good point! The code we have today doesn't let the values overflow (mask on increment) and I have tested on up to a trillion packets.

I did have to check in the simplified ring that I used in the "write an app in pure assembler" experiment (SnabbCo/snabbswitch#603) and luckily that does use unsigned (and mask-on-access).

lukego commented 8 years ago

@pkhuong Going meta: I observe that it is fun and easy to come up with ideas for micro-optimizations, both for the CPU and for the JIT, and we really need to make it equally fun and easy to test and verify them. Over time we are building up the performance tests in our CI to cover real end-to-end applications. I hope to reach the point where we can simply dash off each optimization idea in a PR and automatically find out whether it is good or not.

I am avoiding merging micro-optimizations when we don't have a suitable test environment to verify that they are effective. I feel like the issues in play are just too subtle to talk about meaningfully without both strong mental models and good empirical tests. (Thinking of comp.arch sort of discussions where arbitrarily impressive arguments can be made both for and against a given change for a given CPU architecture and you would never find out which one was right :-))

Having said that, the efforts to test and verify often come when somebody is stretching to meet a performance goal and is more than happy to test ideas that have been being kicked around in the past. cc @kbara :).

tokenrove commented 8 years ago

Automated performance regression testing is of great interest to me; would you mind blogging at some point about how you're doing it on the Snabb Switch project?

kbara commented 8 years ago

I haven't had a chance to look at it closely, but I know @DRMacIver has been doing some interesting automated performance regression tests on Hypothesis.

Re: @lukego , I tend to agree about how subtle many of these things are; empirical testing in one context can tell you surprisingly little! That said, some empirical validation (or refutation) is usually better than none. :)

DRMacIver commented 8 years ago

I wouldn't go as far as to say it was interesting. There's a big line in my benchmarking script that says "Is this statistically sound? ¯(ツ)/¯". :-)

If you're interested, here's the comparison script: https://github.com/DRMacIver/hypothesis/blob/benchmarking/scripts/benchcompare.py

And it's used here: https://github.com/DRMacIver/hypothesis/blob/benchmarking/scripts/runbenchmark.sh

It's not really good enough to detect micro-optimisations (you'd need a lot more data points for starters). It's mostly designed to make sure I haven't accidentally introduced any major regressions since the last released version.

lukego commented 8 years ago

@kbara @pkhuong Sure is wonderful when an optimization has a satisfying explanation and a repeatable empirical test. This should get easier for us over time as our profiling tools improve and our stable of standard benchmarks expands to cover more interesting cases.

The idea of moving the length field to the start of the packet is an interesting one. If this is effective I would hope that we could see an improvement on the CI NFV DPDK benchmark (blasting packets from a NIC through a VM) and/or the synthetic basic1 benchmark (feeding packets through a simple app network). Those are CPU-bound benchmarks that touch packet structs at the rate of ~ 10-100 million times per second.

I am also really keen on improving the profiling and PMU support. For example it would be nice to have fine-grained regression tests on things like the number of L2/L3 cache accesses per packet for various apps. These could be used to test ideas for microarchitecture-dependent optimizations e.g. the idea that by moving the length field to before the payload we would eliminate one cache miss per packet. I am actively working on building up the PMU profiling support for this purpose - see SnabbCo/snabbswitch#616.

I also want to add support for the PEBS feature of the PMU. Then the CPU can sample the addresses of cache misses and log them into a memory buffer. We should then be able to analyze those addresses and characterize the rate of cache misses for e.g. "packet.length", "Nth cache line of packet.data", "NIC DMA descriptor", "QEMU guest memory", "Lua data", "FFI data", etc. So far it is working out quite well in traceprof (SnabbCo/snabbswitch#623) to collect pretty raw samples and then analyze them in-process with access to all program state.

Alignment is an interesting topic. DMA requirements tend not to be so strict: on an Intel NIC IIRC any even-numbered address is OK. SIMD is another story though: many apps have their bottlenecks in SIMD operations like memcpy, memmove, or IP checksum and we may be able to exploit alignment here. It will be interesting to geek out deeply on this in the future. Could be that microarchitectural details come into conflict e.g. the desire to share a cache line between packet metadata and actual data vs. the desire to align the data on a 64-byte boundary for AVX512. Gonna be fun.. and gonna need some great tools to do this properly and avoid the risky of just spinning our wheels.

lukego commented 8 years ago

@tokenrove The master of the CI performance tests is @eugeneia, you should lobby him to start a blog :). He is actually doing major surgery on this now in SnabbCo/snabbswitch#626.

The basic idea is to have a collection of benchmarks that actually represent stuff that matters (is valuable to improve, is not allowed to regress) and automatically compare each PR to the master branch. See example results at the top of any CI test result on the repo e.g. https://gist.github.com/SnabbBot/1958aae01fe2be9897a6

One challenge for us is that realistic tests often are fairly involved: start a load generator on a 10G NIC, start a Snabb application with a particular configuration on another NIC that is directly cabled, start a virtual machine in a particular QEMU version and running a particular OS, and then see how fast packets can be moved through the whole chain. Ideally we would like to test quite a few permutations of such setups. The solution we are testing - and so far looks very promising - is to use Docker to capture this whole environment in a container that can easily be deployed to new machines.

eugeneia commented 8 years ago

@tokenrove I hear you! Will let you know once I do a write up. :)

tokenrove commented 8 years ago

@eugeneia @lukego Great! I find it's really hard to do this without either lots of intermittent failures or enough wiggle room that small regressions aren't notice until they build up.

What I'm really interested in is a system like proposed in Mostafa and Krintz's paper "Tracking Performance Across Software Revisions" or Kalbarczyk et al.'s "Analyzing Performance Differences between Multiple Code Versions" (and related literature) but they seem like massive overkill for most projects where hand-written benchmarks can serve the same purpose.

@DRMacIver Thanks, it's always useful to see what people are using in practice!

eugeneia commented 8 years ago

@tokenrove @lukego I finally wrote about Continuous Integration for Snabb Switch on my blog. :wink: Is it telling?

tokenrove commented 8 years ago

@eugeneia Thanks!

lukego commented 8 years ago

@eugeneia Awesome!!