lukego commented 8 years ago

Here is a radical idea that came up in a conversation with @wingo:

To adopt a "StraightNIC" design where device drivers are absolutely minimal: one transmit queue, one receive queue, end of story. The rest is done in software and works exactly the same for all I/O sources (1G/10G/40G/100G, Intel/Mellanox/Virtio/Tuntap, ...).

This would be following the example of the "straightline" redesign where we made our packet struct absolutely minimal by removing scatter-gather buffer chains, checksum metadata, separate memory pools, etc. This invalidated a lot of special-case optimizations but the overall result has been simpler code and better performance.

The gambit is that focusing on the simplest and most general case leads to the best overall outcome i.e. expect that special-case optimizations like VMDq, RSS, FlowDirector, etc, are a net loss in the big picture because they lead to user-visible inconsistency and soak up development time e.g. when trying to add new hardware support with sufficient functionality.

Thoughts???

See also I/O 2.0 (#687). This would also depend very much on highly optimized software replacements for the relevant NIC functions (#691).

kbara commented 8 years ago

If this is feasible, it would be excellent in many ways. The only downside is that it'll absolutely fully load a whole CPU core, and the budget per packet will be measured in the low tens of nanoseconds. I wonder how often we'll run into thermal throttling of other cores on the same physical CPU... Note that we'd then also need to deal with or strip vlans in software, and there has been previous discussion about how that much copying of memory is too costly. Architecturally, I love it. Pragmatically, I'll have to see it to believe it.

wingo commented 8 years ago

AFAIU with this design, you could connect directly to the NIC if you wanted to handle all of its traffic, so no difference from the current status.

Just brainstorming the sorts of things we'd need to write if we do this :)

A software version of VMDq, to start with, to replace the current functionality of the Intel82599 drivers. It should dispatch to a number of queues via VLAN or MAC address.
VLAN stripping / tagging, and stamping of outgoing MAC addresses. Should this be done by the workload program / app network I guess? An open question.
There is some strange mirror facility in the Intel cards that might also need to be replicated; not sure though.

Would we use virtio-style queues between processes or something bespoke? I don't know the tradeoffs. I would lean towards bespoke, of course :)

Some things that this would enable us to have:

Same queueing / dispatch functionality with all NICs.
Easier custom queueing / dispatch (e.g. you could transparently do dispatch on L3 header fields in flexible ways).
The machinery involved would be re-usable to break program networks into more processes, which might help scaling.
On the other hand, this facility might allow for more optimal resource usage. Let's say you have an app which can process 25 Gbps. If you run it on a 10g port you're wasting cores; if you run it on a 100Gbps port you can't keep up. By allowing us to divide incoming traffic into a queue count flexibly, in a way that suits the performance of the workload, we might be able to use cores in a better way. Of course this goes the other way too, for when an app can only handle 5 Gbps and you need more horizontal scalability.

Needless to say I like it and would love to have this as part of the Snabb story, contingent on it actually working :-))

wingo commented 8 years ago

Are you thinking of copying packet data from the dispatcher to sub-process ring buffers, or sharing packets in some global pool and passing around descriptors instead?

struct ipc_link {
  struct packet packets[LINK_RING_SIZE];
  uint64_t read, write;
};

^ that would be a delightful link, if it performed ok :)

wingo commented 8 years ago

Though, I guess since you can't write a packet atomically, you run the risk of corruption. I suppose you could detect that via how far the write pointer advanced while you were reading... a big challenge in any case :)

plajjan commented 8 years ago

VLAN stripping / tagging, and stamping of outgoing MAC addresses. Should this be done by the workload program / app network I guess? An open question.

Isn't it already? Or does the NIC stamp src MAC when VMDq is enabled? It cannot reasonably put dst MAC in there (how would it know?). I say it stays in app network.

I'm torn over the subject in general. Sure, simplification sounds like a good thing but on the other hand I want 100G and I don't think it's reasonable to expect a single core to be able to take over the RSS job from the NIC, so I'd like RSS. At the same time I don't know enough about the drivers, are they really that complex?

lukego commented 8 years ago

I have also been feeling conflicted in exactly the way @plajjan describes. On the one hand the pure software implementation is compellingly simple but on the other hand it feels too optimistic to assume this will always be sufficient.

On reflection now I feel that neither extreme is right - all hardware or all software - and I have a proposal for a compromise. Let me review the extremes and then spell out the proposal for a compromise.

Hardware-oriented approach

The fundamental trouble with the all-hardware approach is dealing with the variability between NICs. How do you write drivers with a consistent interface when the hardware they are controlling has different behavior?

For example, here are some of the questions that have different answers from one NIC to the next:

How can we dispatch traffic - L2, L3, L4 header?
Which L3/L4 protocols are recognized for dispatching?
Can we do exact matches and hashing at the same time - if so how does this impact the number of queues available?
Can we do "hairpin turns" on packets and does this bypass any other dispatching features?
Which standard SNMP/YANG counters can be sourced "for free" from hardware registers?

One way to deal with this is to say that drivers will each have a unique interface that describes the card they support. This punts the problem up to the application developer e.g. to pick one card to target and deal with its pecularities. This is basically the situation today e.g. the NFV application has targeted the Intel 82599 specifically and when we added Solarflare support we worked with them to extend their firmware for feature parity.

However, in the big picture this is not so satisfactory. One problem is that it does not help applications to support many NICs and still benefting from the hardware that is available. Another problem is that hardware sometimes lets you down later in the lifecycle -- for example you get a new requirement to support a transport protocol that the NIC does not understand (e.g. GTP, L2TP, MPLS, etc) and that could screw up your application by hashing all your input into the same bucket -- forcing you to redesign the application or to new hardware.

Software-oriented approach

The all-software approach has the virtue of uniformity and predictability. This is huge, as described above. The primary problem is uncertainty about what this costs: how much CPU must you reserve for dispatching? does it create a bottleneck that limits your peak packet rate? Then once you have these answers the secondary problem is: Is it worth it? for which applications?

I would love to reach the point where we can say that software dispatching has modest overhead and is the right choice for all but the most extreme applications. However, in practice it is a lot of work to understand the possibilities, and there is no guarantee that the results will represent the right trade-off for application developers e.g. that the NFV application would really be better off switching from VMDq to a software switch+tag app.

Proposed compromise

I think we need to make a pragmatic compromise that takes advantage of hardware features when available, software-emulates hardware features that are not available, and provides well-defined interfaces to application developers.

I have a proposal for how we could do this and it is one of those "turtles all the way down" ideas.

The idea is that we could define abstract apps that become concrete when you instantiate them. The new() method of an abstract app would return an app network fragment (i.e. a config object) instead of a single app instance. This fragment could be generated dynamically based on the config supplied to the abstract app: for example if the config specifies a PCI address then the app could check the device ID and return the driver that is appropriate. If the driver does not support all of the necessary features then the app network fragment could include some supplementary software apps.

That is the whole idea. Let me give an example to be more concrete.

Suppose we developed a switched_nic app:

switched nic - abstract view 2

This app can have multiple ports attached to it, can dispatch packets based on DMAC and/or VLAN, can be configured to automatically insert/remove 802.1Q VLAN tags, and can optionally provide all of the interface counters required by SNMP MIB-II objects.

Suppose that you instantiate the switched_nic app and provide the PCI address of an Intel 82599 NIC. In this case the new() method would check the PCI device ID and recognize that all of the processing can be offloaded onto hardware. It would return a config object that specifies a suitable intel_app to splice into the app network as the implementation of the switched_nic:

switched nic - intel vmdq

Now suppose that you instantiate the switched_nic with a tap device. In this case the new() method would know that emulation is needed and return an app network that includes both a tap device for I/O and a vlan_switch for emulating VMDq:

switched nic - tap and software

If the config required SNMP counters to be provided then we may also need to include a software app that inspects the packets and updates the appropriate tallies (unicast packets, broadcast packets, etc).

This would mean that applications depending on VMDq-like functionality could be deployed on every I/O medium that the switched_nic app supports.

The overall effect would seem to be to make life easy for driver developers (just implement what the card supports), and to make life easy for application developers (just pick the abstraction that suits your application), but difficult for people writing the abstract apps (the switched_nic app would not be trivial and would require extensive tests and documentation to give users confidence and understanding).

This would also allow the hardware vs software battle to continue in the background. Vendors can add powerful new features to NICs, hackers can write optimized software alternatives, and over time we will see if these keep each other in balance or whether the wind blows one way or the other.

plajjan commented 8 years ago

@lukego I love it! I was actually thinking about something similar although I think you expressed my abstract spaghetti thoughts into something much more concrete. For my SnabbDDoS program I want to support:

number of NICs:
- dual NIC, i.e. use one port for ingress ("dirty") and another for egress ("clean") traffic
- single NIC, -on-a-stick. Since traffic is unidirectional we don't need full-duplex and we can halve required number of NICs with this approach
type of NIC
- 82599 for production
- tap for development, debugging or whatever

Single NIC (82599 or tap) requires VLAN tagging (unless we want really weirdo config on the router with PBR or similar). Now, to complicate things I have an issue on the 82599EB-old-a-f card that vmdq doesn't seem to work properly so for most 82599 cards I want vmdq but for this particular version I want to use software for VLAN demux/muxing.

Thus the final config matrix is rather complicated. The current state of the code in SnabbDDoS program for initing the app network is a mess and it makes me sad panda just looking at it. I was thinking if I could abstract the 82599-vmdq-or-software-vlan into an app to simplify things and I think this is just what you have described here but in a much more generic and elegant way.

Complexity of switched_nic could, as you point out, shoot through the roof if we are not careful. It's probably a good idea to start out with a minimal feature set and ignore more advanced features to begin with. I'm thinking just matching the vmdq feature set, like vlan (de)tag and src mac rewrite, so it's noop on 82599 but it will do things for a tap interface.

petebristow commented 8 years ago

What you describe sounds like relatively heavy weight composite apps and is using the language of an app. I think this problem would be better thought about as a configuration problem, perhaps answered by supporting configuration macros that give a bundle of functionality but are transparent to the user, and thought about in terms of a declarative configuration file. It's not immediately obvious that making these composite apps flexible enough to support all use cases won't become another point of configuration and we will be back at square 1 with a configuration problem, but have composite apps as well. switched_nic sounds like it's a lot of work but none of the following do

An app that dispatches to a set of links based on statically defined macs
An app that adds / removes Vlan tags
An app that ECMPs traffic over a set of output links @plajjan requires MPLS labels to be attached cool have an mpls app @petebristow requires Vlans Someoneelse requires QinQ A.Nother wants a more labels for segment routing. All of these are configuration problems rather than app problems. These are all simple configuration matters. Having pure configuration macros means they are seen as lightweight reusable chunks. I can now have a chunk that covers my physical nic + fastPathApp + tap interface for slowpath I want linux to handle. It would mean we need to revisit the lua as configuration decision but I think this needs to be done any way as part of the Yang debate. We also need to think about what configuring a multiprocess snabb network should look like.

I'm all in favor of the StraightNIC idea with the extension of RSS support. It's my feeling that most of the acute performance concerns that lead to fancy hardware and complex drivers could be alleviated with having more cores. Some apps designs might not parallelize to many cores nicely, but it's my slightly biased opinion that they were going to hit problems anyway. If the role is 'popular' enough to need more than 1 core it probably also needs more than 1 10gig link, more than 1 server and more than one site. So you may as well figure out how to scale horizontally from the start. Not all NICs support all RSS hash modes, but then there aren't many guarantees on what upstream ECMP hashing will look like either. Huge flows that would overwhelm a given path are a pain but they always have been and always will be. 100gig was great until we needed n*100gig :( If you get 1million pps per core and have 6 mpps of load you run 6 processes each with an rx and tx queue. This stops working when you run out of queues, but the 82599 has 16 queues, the solarflare appears to have 64 I've added RSS support to the Intel1g driver and I'm just finishing the multi process support for it. It's run as entirely separate snabb instances each with it's own distinct config, which is nice to demo but sounds pretty horrific to operationalize. I'll try and get a [wip] PR in, in the next few days.

petebristow commented 8 years ago

@plajjan Should SnabbDos have to deal with site specific configuration like that? Wouldn't having a stand app + a site specific config file be much better? I feel like that config should be able to define the app network in a declarative non lua way. Your NMS/CMS/OSS then worries about the class of host and what the config should look like.

plajjan commented 8 years ago

I think this problem would be better thought about as a configuration problem, perhaps answered by supporting configuration macros that give a bundle of functionality but are transparent to the user, and thought about in terms of a declarative configuration file. It's not immediately obvious that making these composite apps flexible enough to support all use cases won't become another point of configuration and we will be back at square 1 with a configuration problem, but have composite apps as well.

I was thinking about switched_nic as a form of config helper, not that it actually does all the work. For example, I instantiate a switched_nic and say I want VLAN X on NIC Y. If Y=02:00.0 is an 82599 it will be configured with VMDq for VLAN tagging. If Y happens to be a tap interface then switched_nic will instantiate a Tap driver and a VlanMux and connect those together. Same end result, something that takes input packets and strips VLAN tags!

switched_nic sounds like it's a lot of work but none of the following do

An app that dispatches to a set of links based on statically defined macs An app that adds / removes Vlan tags

No, not really. It doesn't sound like a lot of work but that's probably because we have different ideas of what it is or should do. To me it's a convenience API so I don't have to think about what the NIC supports or not.

Just adding the apps you list won't do anything. It's kind of the situation we have today. I already have an app that adds/removes VLANs (#863) but how do I add that to my app network and when do I need to? Should I never use VMDq so my app network is consistent? (StraightNIC!) or should I allow hardware offload for some stuff. That complexity is what I would like to abstract away.

An app that ECMPs traffic over a set of output links @plajjan requires MPLS labels to be attached cool have an mpls app @petebristow requires Vlans Someoneelse requires QinQ A.Nother wants a more labels for segment routing. All of these are configuration problems rather than app problems.

Well, this is a problem for my program not my app. The DDoS app doesn't care about the NIC, it just wants packets. But the SnabbDDoS program does need to care, how else would this whole thing work?

Also, funny that you would mention MPLS since I'm working with a network where half the ho-ha is about us not using MPLS ;)

These are all simple configuration matters. Having pure configuration macros means they are seen as lightweight reusable chunks. I can now have a chunk that covers my physical nic + fastPathApp + tap interface for slowpath I want linux to handle. It would mean we need to revisit the lua as configuration decision but I think this needs to be done any way as part of the Yang debate. We also need to think about what configuring a multiprocess snabb network should look like.

Okay. So it sounds like you have bigger things in mind here. I was thinking of switched_nic as something I'd write in a day. Rewriting Snabb's config to be something other than Lua sounds like a slightly bigger topic.

The way things are right now, Snabb offers nothing in this area. It is up to each developer to build a program that takes the configuration options they wish to support and in effect that will limit the deployment options you have as a user. I personally have no need for MPLS so I will not add that to SnabbDDoS.

I definitely think there are improvements to be made here. I am not really interested in writing all this "glue" stuff for deployment. Snabb doesn't even have MPLS support today but let's say I upstream SnabbDDoS and someone adds MPLS tomorrow, wouldn't it be neat if, like you express, a user could just add some config options and use SnabbDDoS in an MPLS deployment? I think so!

Snabb is really just a basic framework at this point. If we look at the only other thing out there that is vaguely similar - VPP - there is still a stark contrast in what they provide. Both Snabb and VPP have the ideas of nodes in a graph that each do one little thing well. Both use batching and various other clever techniques to achieve high performance. But VPP offers a lot more out of the box. You can spin one up, configure a couple of interfaces and some static routes, through a CLI, and have it forward packets for you. It's like Snabb but there is a default app network which provides you with lots of stuff.

I wrote a DDoS app, a node in the graph/app network.

In VPP I would insert that node in a pre-existing graph so I could benefit from all the routing / tunneling / whatever that is already in VPP.
In Snabb I am expected to not only write the app/node but also to build the entire graph / app network around it.

I guess Snabb will move in the direction of VPP. Once there are more standard apps available it makes sense to ship most of these in a default app network. Developers are always allowed to start over with a clean slate if they wish but they shouldn't need to.

@petebristow I don't want my NMS to have to worry about whether NIC + VLAN is actually two apps or if one app will do the job. Internal app network is largely irrelevant to external parties/users. Who wants to think about having to put a "reassembler" app in there? It makes sense from a dev perspective to have it as a separate app but from a user perspective it sucks having to think about it.

petebristow commented 8 years ago

@lukego

897 is a stab at my StraightNIC + RSS vision. Next is to add multi queue to the tap driver. From there I'd like to strip down the existing 82599 driver and add RSS support. This would be in conflict with the virtio/vmdq approach but would buy some head room for pure packet forwarding, so I can see it being a separate driver, perhaps becoming the driver if / when virtio multiqueue arrives, who knows.

Either way on the face of it is a refactored 82599 driver likely to get accepted?

lukego commented 8 years ago

I am on board with this vision! The code on #897 looks very good to me. I would be happy to bring in an 82599 driver in this style.

Have you considered writing an X710/XL710 driver rather than 82599 next? This would punt the compatibility issues down the road a little and also give us a working driver for cool new hardware that is abundant in the lab :-) e.g. lugano-1 and lugano-2.

VMDq could be supported in the same way as RSS here, no?

petebristow commented 8 years ago

The problem with the X710 is that I have none of them in my network but have a huge number of 82599 that I would like to power with snabb but need RSS in place first. I haven't looked into the VMDq facilities if it can follow a similar scheme to RSS then great.

lukego commented 8 years ago

Cool just checking :).

I am hopeful that RSS and VMDq can work in combination in a simple way but we shall see.

eugeneia commented 8 years ago

Just to chime in here: from the SnabbNFV perspective there is currently one NIC App per MAC-address (VMDq), RSS is not applicable if I understand correctly. If the 82599 were single-app/many-links in VMDq mode instead of “sub-apps” that would be much better for my current concerns in #886. I don't like the multi-app approach of the 82599 because it has all the disadvantages but doesn't actually yield any benefits. Assuming that the same thing that https://github.com/snabbco/snabb/pull/897 does for RSS can be done for VMDq on the other hand makes this approach interesting.

So I see one I/O interface that is 1-N (which I would love to have right now) and one that is 1-1 (which I think might be smarter):

From the SnabbNFV perspective the disadvantage for 1-1 (the first on the diagram) would be possible locking overhead even though everything runs in a single process/thread. On the same hand there is a big advantage in that the SnabbNFV application could be horizontally scaled across cores?

We we would end up with interfaces still, but simpler ones:

simpleIf(<pci>) => rx/tx — Can be instantiated once per <pci>, has all traffic
rssIf(<pci>) => rx/tx — Can be instantiated many times per <pci>, has some traffic
vlanIf(<pci>, <vlan>) => rx/tx — Can be instantiated many times per <pci>, has traffic by <vlan>
macIf(<pci>, <mac>) => rx/tx— Can be instantiated many times per <pci>, has traffic by <mac>
...

morphyno commented 6 years ago

Looking at this 2 yrs since this post started. Netdev has made alot of in roads on this, with tc_offloading also a common driver feature, I feel this could be implemented using tc and bcc and be a complementing to snabb

snabbco / snabb

Straight NIC #801

Hardware-oriented approach

Software-oriented approach

Proposed compromise