snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.95k stars 298 forks source link

Allow multiple Snabb processes to service a single NIC #757

Open wingo opened 8 years ago

wingo commented 8 years ago

Ideally all Snabb programs are super-zippy and run on just one core. It can be the case though that the workload is significant compared to the capacities of a core; for example, if you miss L3 once per packet, you're limited to some 10 or 12 MPPS. It would be good if we had a cheap horizontal scalability solution.

Related issues that discuss the issue more generally:

This issue is to make a design of how to use Intel 82599 hardware to enable horizontal scalability, so we can get feedback from the experts, and feedback on Snabb API design. This issue would also be a good place for people to say if this is a terrible idea, and instead we should use a software solution or something :)

wingo commented 8 years ago

screenshot from 2016-02-15 12-23-54

Above is a diagram of what happens to incoming packets, from the 82599 datasheet. I think the feature we need is called RSS, or receive-side scaling. The traffic is dispatched to different pools according to a hash function of its contents. The RSS hash function is specified by Microsoft, AIUI: https://msdn.microsoft.com/en-us/library/windows/hardware/ff570725(v=vs.85).aspx. The benefits of hashing compared to simple round-robin distribution is that multi-packet operations like reassembly will be possible in the sub-processes, as the packets that are related to each other are all directed to the same sub-processes.

wingo commented 8 years ago

Probably the right thing to do is to initialize the NIC only if needed. If it's already in VMDq mode, we don't initialize it, and instead add a pool to the VT pool list, whatever that means :)

lukego commented 8 years ago

On a high level I reckon we should start to separate app networks into two parts: the application logic that runs on a given core (in the middle) and the I/O connectivity (at the edges). Ideally the application logic part could be equally happy when connected to many different I/O mechanisms (intel1g, intel10g, mellanox, tuntap, vhost-user, etc). See also #720.

On a low level each application needs to decide how it wants the traffic sharded between processes and how to accomplish that.

For example, do you want the outside world to see a single 100G lwAFTR (single L2/L3 address) or 10x10G ones (separate addresses)? If the former then you want traffic to be hashed across your instances e.g. using the hardware feature RSS with one process handling e.g. ARP/IPv6-ND and other traffic hashed across all instances. If you want the latter then you want packets to be dispatched between instances based on a key e.g. destination MAC address and then VMDq is more suited.

These hardware features all have limitations and differences between cards/vendors. For this reason I think it is important to have a comprehensive optimized traffic mux/demux capability in software (#691) and to view hardware mechanisms only as optional offloads to be used when application and NIC happen to match.

The intel1g driver makes it optional to initialize the NIC or not. The idea is that if you run 10 processes then process 0 could be responsible for NIC init while 0-9 would merely attach to TX/RX queues. I want to migrate all current/future drivers towards this model if indeed it works out in real life. (If it is really too constrained for only one process to touch the common registers on the NIC then we could introduce a mutex as a shm object or something like that... we have options.)

That any help?

lukego commented 8 years ago

cc @xrme

wingo commented 8 years ago

Thanks for the thoughts @lukego . The immediate motivation here is for 10Gb NICs actually -- looking for a short-term win for the lwAFTR workload to give it some headroom on small packets. In that regard, from the outside the number of Snabb processes servicing a NIC (or in our case, pair of NICs) should be moot -- whether we handle traffic with one or three processes, no problem. That generalizes to the 100Gb workload of course too. However, what you say about wanting ARP/ND "in the firehose" so to speak is true and it complicates the plan to just run any number of worker processes to service a NIC(-pair).

wingo commented 8 years ago

Cc @dpino

lukego commented 8 years ago

Check out RSS. If you are lucky it can hash packets in a suitable way and ensure that ARP/ND are handled neatly too (e.g. always sent to the same process if that is important to you). It is a much simpler hardware feature than VMDq. If you hit a dead-end there then you need to talk with @xrme about being an early adopter of software mux/demux (#691).

I suspect that RSS will match the lwaftr well. I would be less confident with a stateful NAT because then you have to be stricter about hashing e.g. that the same customer is always hashed onto the same process in both directions. Your application is probably okay with a bit of asymmetry?

kbara commented 8 years ago

The only problem with asymmetry is with fragmentation - and even then, raw asymmetry by direction is ok, just having different fragments that need to be reassembled routed to different places isn't. Having to do arp/ndp multiple times is suboptimal but not really a big problem.

lukego commented 8 years ago

Looking ahead a bit...

I also quite like the idea of having N+1 process structures i.e. N busy-looping traffic processes + one coordinator process that is not processing heavy traffic but could be "punted" certain packets e.g. ARP/ND and perhaps act as a non-realtime-constrained front-end for e.g. configuration changes.

wingo commented 8 years ago

Thanks for the tip. For some reason I thought we would need to enable virtualization, but perhaps that is just an artifact of how the 82599 driver is written and not an essential thing. With world enough and time I would spend a couple days refactoring that driver to be more like the intel1g code :)

lukego commented 8 years ago

Just an artifact. Just make process 0 attach to RX queue 0 and TX queue 0, etc, and poke the RSS register with a suitable hashing configuration (e.g. on L3/L4 header). Should work fine. I have some past experience with RSS on the 82599 from before Snabb Switch.

kbara commented 8 years ago

@lukego I totally agree about the N+1 process structure. Having a process handle things like NDP has been on my mind since at least October, and could shake out nicely from the direction this is going.

lukego commented 8 years ago

@kbara @xrme was also talking about some ideas where we could define the overall app network in the "+1" management process and it could fork/spawn all of the worker processes. That could be neat.

kbara commented 8 years ago

Hmm, yeah, that has potential.

lukego commented 8 years ago

@wingo I reckon you will be able to extend the intel10g module easily enough for this use case. The PF object is basically what you want to initialize the NIC and define how packets should be dispatched across queues; the VF object is basically what you want to come along later and attach an app to one TX queue and one RX queue. You just need a variant (config option or new object) that does this with RSS instead of VMDq and this should be straightforward because RSS is a simpler mechanism.

This is assuming you don't find any blocking problems with RSS when looking at its capabilities in the data sheet.

lukego commented 8 years ago

(and you need to deal with the fact that the PF and VF objects will be instantiated in different processes but that should be okay. If you have a fixed number of processes then you should be able to "set and forget" the RSS settings to dispatch across queues 0..n. This is more tricky for VMDq in the NFV application where each process will want to dynamically provision new queues for new MAC addresses.)

lukego commented 8 years ago

... oh and then there is the question how "How do you do RSS on Virtio-net?"

There is multiqueue support for Virtio-net but the details of dispatching seem underspecified to me compared with hardware mechanisms like RSS and VMDq. The basic strategy I have in mind is to be flexible on the NFV application and try to support extensions for whatever kind of hashing the VMs really need. Still: as an application developer I would want a software mux/demux as insurance to make sure the application can be deployed on diverse I/O infrastructure.

dpino commented 8 years ago

I have given a try to implement RSS support. The comments in this thread have been very helpful.

Thinking in the current context of the lwAFTR, we need to set vmdq = true (it seems VLAN tag stripping by hardware is only available in VMDq mode). Probably we will need to modify the if conf.vmdq branch in Intel82599:new() https://github.com/SnabbCo/snabbswitch/blob/master/src/apps/intel/intel_app.lua#L38

  if conf.vmdq then
      if devices[conf.pciaddr] == nil then
         devices[conf.pciaddr] = {pf=intel10g.new_pf(conf):open(), vflist={}}
      end
      local dev = devices[conf.pciaddr]
      local poolnum = firsthole(dev.vflist)-1
      local vf = dev.pf:new_vf(poolnum)
      dev.vflist[poolnum+1] = vf
      return setmetatable({dev=vf:open(conf)}, Intel82599)
   else

In https://github.com/SnabbCo/snabbswitch/issues/522#issuecomment-114013203 @javierguerragiraldez suggested to adapt the Intel82599 driver to support RSS in the following way:

1) When a device is first initialized, it initializes its virtual devices too:

nic0 args={vmdq='32*4', pools={
    [1]={mac='52:54:00:01:01:01', vlan=18, rate_limit=1e9},
    [2]={mac='52:54:00:02:02:02', vlan=18, rss='ipv6'},
    [3]={mac='52:54:00:10:10:10', vlan=1, mirror_pools={1,2}},
}

Later when the same process, or a different one, initializes the same driver, it simply requests a TX/RX queue from the VF pool.

nic0rx1 args={rxpool=1},
nic0tx1 args={txpool=1},

To me the main challenge is when a different process requests a queue from a VF. It seems that some sort of interprocess communication will be needed (a master process initializes all the VFs and a slave process requests a TX/RX queueu from a VF, or much simpler, a VF). Is it possible to use the shared memory mechanisms to share an object as complex as the VF object? If not, what would be the right way to share the TX/RX queues among different processes?

javierguerragiraldez commented 8 years ago

To me the main challenge is when a different process requests a queue from a VF. It seems that some sort of interprocess communication will be needed (a master process initializes all the VFs and a slave process requests a TX/RX queueu from a VF, or much simpler, a VF).

I'm not sure that IPC is absolutely needed.... the 'second' process could just open the device and grab a couple of pools, blindly hoping that the main NIC initialization is already done.

dpino commented 8 years ago

@javierguerragiraldez Thanks for the input Javier. I thought about that, but the physical device is accessed via mapped pci memory and has an exclusive lock on the device. If a second process tries to map the same physical device, there's an error. I can change the lock to a shared lock, that would allow that two different processes can read the same device. Do you think that would work?

javierguerragiraldez commented 8 years ago

this is where my memory gets fuzzy; but: