snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.95k stars 298 forks source link

100 Gbps multiplex/demultiplex #691

Open lukego opened 8 years ago

lukego commented 8 years ago

What?

@rxme and I have been talking about the problem of dispatching a 100 Gbps traffic stream with one CPU core. This would be necessary to implement the "100G with software dispatching" multiprocessor design sketched in #685. What we need is a general N:M multiplexor-demultiplexor app that can split and merge packets between many links at very high speeds.

This is essentially a software implementation of NIC hardware features like RSS and VMDq that perform some inspection of a packet before deciding which queue to assign it to, for example to lookup its destination MAC in a small table or to hash its destination IP address.

Why?

There would be advantages to being able to do this in software instead of relying on hardware:

  1. Extensible to new protocols (NICs support very few).
  2. Create a uniform interface that can aggregate links (1x100G, 10x10G, 2x40G, etc).
  3. Support many NICs without being limited to "lowest common denominator" features.
  4. Software could be used as a fallback mechanism when hardware support is not available due to NIC or driver limitation.

If the software implementation was really excellent then it may even be practical to drop features from our drivers and simplify our hardware integration.

How?

The challenge is to achieve practical performance. For example, to dispatch 50 million packets per second on a 2.4 GHz CPU the processing budget would be 48 cycles per packet. The work that would need to be done in those cycles is approximately:

  1. Get packet from input source e.g. NIC hardware RX queue.
  2. Load relevant packet payload from L3 cache (DMA'd) into CPU registers.
  3. Choose the right output link with hash/lookup.
  4. Queue packet on an inter-process output link.
  5. Retrieve free buffers and refill the hardware RX queue.

Can we do this work within the cycle budget? The plan is to try and see. Since we are looking for the absolute limit of what the processor can achieve it seems to make sense to start by building up the parts in assembler. Later we can see whether we can afford to introduce any abstraction or not e.g. to write parts of the code in Lua.

References

Some relevant prior work that provides an encouraging backdrop:

  1. 603 is a simple assembler implementation of an app network. In this example simple operations like transmitting a packet on a link or allocating from a freelist seem to cost around 5 cycles each.

  2. 557 The firehose packet capture application is able to receive and lightly inspect tens of millions of packets per second.

xrme commented 8 years ago

At the risk of looking like Barbie ("learning new tools is hard!"), I checked in the totally non-working code that I've been writing.

See xrme/snabbswitch@636ab74372014c4152b3e1bdc8c63c8ad1729b8b.

lukego commented 8 years ago

@xrme Looks like the right direction to me :).

I start to feel like a DynASM caveman myself.. I have not been using the variable registers like Rq(i) in my code. It does seem like there is quite some flexibility in how to factor assembler code with DynASM.

I have also been admiring the DynASM AES crypto code in eugeneia/snabbswitch#8. First time I have seen higher-order functions used in an assembler program :-). Seems like it's possible to express things like loop unrolling in fairly short and neat ways with DynASM.

I like this kind of programming :-). The last assembler that I really used in anger myself was AsmOne and that was really a different era!