snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.97k stars 300 forks source link

Thought experiment: Array Programming #1099

Open lukego opened 7 years ago

lukego commented 7 years ago

Here is a thought experiment: How would a Snabb program look if you used the Array Programming paradigm?

I wonder this because I have recently been programming in R - an array-based programming language - and I wonder about the merits of borrowing some ideas for Snabb. Here's a summary of the idea, a pseudo-code example, and a thought about potential advantages.

Basic idea

The basic characteristics of array programming are:

  1. Variables are bound to arrays of objects (e.g. array of 100 packets) rather than single objects (e.g. one packet.)
  2. Conditional execution is achieved by separating objects into groups and applying different operations to each group (instead of an if branch for each packet.)
  3. Objects can be associated via the indexes e.g. when dst[i] is the destination address of packets[i].

Example

Here is a pseudo-code learning ethernet bridge written in array programming style:

local table = learning_bridge_fwd_table.new()
local ports = { intel10g("01:00.0"), intel10g("02:00.0"), ...)

for i = 0, #ports-1 do                  -- i = input port number
   -- Receive packets from the port's NIC
   local packets = io.receive(ports[i]) -- array of packets from NIC
   -- Learn that the source MACs live on this port
   local src = map(packets, eth.source) -- array of ethernet source addresses
   table:learn(src, i)                  -- update forwarding table
   -- Split into multicast and unicast groups
   local multicast, unicast =           -- subsets of multicast and unicast
      split(p, dst, ethernet.is_multicast)
   -- Unicast: map destination address to output port.
   local dst = map(unicast, eth.dest)   -- array of ethernet dest
   local dport = table:lookup(dst)      -- for each port, indexes of packets
   -- Send packets to each output port
   for o = 0, #ports-1 do
      if o != i then                    -- (don't loop back to input port)
     io.transmit(ports[i], multicast + unicast[dport[o]])
      end
   end
end

Practicality

The downside I see here is that the programming style is less familiar. Could even be that it's fundamentally more awkward than the familiar packet-at-a-time style.

The upside I see is that it may be easier to optimize programs in which each line of code is operating on a batch of ~100 packets thanks to improved divide-and-conquer possibilities:

Conclusion

So! Highly speculative but potentially interesting? Could be an idea to trade some convenience for easier optimization? Experiments needed but whaddayarackon?

plajjan commented 7 years ago

Maybe it would look like P4 (p4.org) ?

While fd.io / VPP even has "vector processing" in the name I think it's different. I would say it's more "batching" or similar to the streaming ctable where we do the same thing (lookup / transformation / whatever) on a bunch of packets, then move on to the next step where we do some other thing on that bunch of packets. It reduces cache trashing and so speeds up things considerably but is distinct from how a true vector processor works, i.e. actually process two or more packets at the exact same time.

To continue on the P4 thread... it's built for chips like Barefoot's Tofino chip. They have themselves drawn strong similarities to GPUs and other fast signal processing chips applying instructions in parallel, i.e. vector processing. P4 as a language provides constructs for parallel processing. I think it assumes a parallel platform but it doesn't help the user much since it also exposes parallel semantics so the whole burden of putting it together is on the user.

I think the current (I guess "scalar") way of Snabb is very easy to learn. Ease of learning and short getting-going time are some of the strongest points for Snabb IMHO. Accomplished programmers might as well use VPP. If Snabb turns into some esoteric paralell-processing semantics it will significantly increase the threshold for many and thus the appeal. Therein lies a challenge of course: could some parallel processing be introduced without significantly making things more difficult? I don't know but I would suggest a thorough analysis and comparison with P4 to start with.

I guess there might also be an advantage to a platform (snabb that is) that supports both mode of operations. One can start out scalar and a more experienced brogrammer can convert (parts?) of the program into parallel processing making it a rather smooth transition...

lukego commented 7 years ago

Great perspective @plajjan!

Agree that it's important to focus on what is unique about Snabb e.g. being able to create practical network equipment without already being a grizzled system programmer.

Here is a new take on the idea. Just speculating that array-oriented operations are something that we may want to do more of in the future.

Snabb apps typically look something like:

while not link.empty(input) do
   local p = link.receive(input)
   local nexthop = route(p)
   ...
end

Here we have many packets available at the same time but for the sake of simplicity we process them one at a time. Now suppose we had a new library for routing packets and that it operates on many packets at once. One simple way to hook that in would be to put some array/vector operations at the start of the loop and then stay packet-at-a-time on the inside. For example:

function push ()
   local packets = link.receive_all(input)   -- pre-receive all packets
   local nexthops = routerlib.route(packets) -- pre-lookup all routes
   for i = 1, #packets do
      local p = packets[i]
      local route = nexthops[i]
      ...
   end
end

So likely incorporating batch-mode libraries is not really a hard problem when we are working in a general purpose language and the engine is feeding us batches of packets already.

Just continuing the pseudo-code braindump :-) here is a whimsical idea. Suppose you were building a simple router that wants to sort packets into categories ("arp for me", "ICMP for me", "IP to forward", etc.) Could be fun to build a little pfmatch style dispatcher that sorts the packets into separate arrays for processing:

local myip = '10.0.0.1'
local mymac = '10:20:30:40:50:60'
local d = pfdispatch(arp  = "ether dst ${mymac} and arp  and arp dst host ${myip}",
                     icmp = "ether dst ${mymac} and icmp and ip dst host ${myip}",
                     fwd  = "ether dst ${mymac} and ip   and not ip dst host ${myip}")

function push ()
   dispatcher:load(input)
   for p in d.arp do
      -- reply to arp
   end
   for p in d.icmp do
      -- reply to ICMP etc
   end
   for p in d.fwd do
      -- forward IP packets to next hop
   end
end

Just thinking that this programming style of first sorting packets into categories and then processing them has some of the nice characteristics: sorting can be a highly optimized operation (in the spirit of a TCAM); LuaJIT will generate separate code for each for loop (easier to think about); performance of each loop could be measured via the timeline log (e.g. get cycles/packet for ARP vs ICMP vs IP separately.)

Programming is fun... :-)

Thanks for the reminder to look at P4 too. I really must look closely. I also have the impression that the Vector in VPP is not really in the sense of an array/vector programming paradigm but closer in nature to Snabb struct link i.e. a set of packets that are all at the same stage in a high-level packet-processing network. I have been saying "array" rather than "vector" to avoid confusion there.