Open lukego opened 8 years ago
At the risk of looking like Barbie ("learning new tools is hard!"), I checked in the totally non-working code that I've been writing.
See xrme/snabbswitch@636ab74372014c4152b3e1bdc8c63c8ad1729b8b.
@xrme Looks like the right direction to me :).
I start to feel like a DynASM caveman myself.. I have not been using the variable registers like Rq(i)
in my code. It does seem like there is quite some flexibility in how to factor assembler code with DynASM.
I have also been admiring the DynASM AES crypto code in eugeneia/snabbswitch#8. First time I have seen higher-order functions used in an assembler program :-). Seems like it's possible to express things like loop unrolling in fairly short and neat ways with DynASM.
I like this kind of programming :-). The last assembler that I really used in anger myself was AsmOne and that was really a different era!
What?
@rxme and I have been talking about the problem of dispatching a 100 Gbps traffic stream with one CPU core. This would be necessary to implement the "100G with software dispatching" multiprocessor design sketched in #685. What we need is a general N:M multiplexor-demultiplexor app that can split and merge packets between many links at very high speeds.
This is essentially a software implementation of NIC hardware features like RSS and VMDq that perform some inspection of a packet before deciding which queue to assign it to, for example to lookup its destination MAC in a small table or to hash its destination IP address.
Why?
There would be advantages to being able to do this in software instead of relying on hardware:
If the software implementation was really excellent then it may even be practical to drop features from our drivers and simplify our hardware integration.
How?
The challenge is to achieve practical performance. For example, to dispatch 50 million packets per second on a 2.4 GHz CPU the processing budget would be 48 cycles per packet. The work that would need to be done in those cycles is approximately:
Can we do this work within the cycle budget? The plan is to try and see. Since we are looking for the absolute limit of what the processor can achieve it seems to make sense to start by building up the parts in assembler. Later we can see whether we can afford to introduce any abstraction or not e.g. to write parts of the code in Lua.
References
Some relevant prior work that provides an encouraging backdrop:
603 is a simple assembler implementation of an app network. In this example simple operations like transmitting a packet on a link or allocating from a freelist seem to cost around 5 cycles each.
557 The
firehose
packet capture application is able to receive and lightly inspect tens of millions of packets per second.