Buffering in the app network: when to hold and when to drop

lukego commented 8 years ago

Question: what should be the baseline policy for when an app should hold a packet and when it should drop a packet? This is relevant to the new RawSocket app on #655.

Possibilities:

Don't hold packets between breaths. Process or drop but don't wait.
Local flow control in apps: don't read from your input queue unless your output queue has capacity.
Handle congestion e.g. with an active queue management algorithm.
... other?

lukego commented 8 years ago

The idea that makes most immediate sense to me would be for the basic behavior of apps to be (1) and for fancy algorithms like (3) to be implemented as their own apps. Flow control can easily have unintended consequences: better to tune the sizes of the buffers and the amount of work that is accepted in each breath so that packets are dropped when capacity is exceeded or links are out of service.

If we adopted that position we would need to eliminate some flow control from our codebase i.e. cases where an app checks its output capability before reading packets off its input link. Instead the transmit function would detect lack of capacity and drop.

Reasonable people may have a different point of view :).

lukego commented 8 years ago

Have been pondering this. My theory is that the simplest and neatest solution is to never apply flow control within the app network. That is: apps will always receive and process all of their available packets and never check whether the output has capacity beforehand. During congestion packets will be dropped at the bottleneck point.

To put this into practice we would need to do a couple of things:

Put some "burst" limiting on apps that bring packets into the app network. For example limit each app to introducing 100 packets per breath. (The intel10g app is already doing this.)
Increase the size of our buffers so that they will only overflow due to actual congestion (packet arrival rate exceeding packet processing rate) and not due to burstiness (e.g. Join app combining several input links that could reasonably overflow its output link or the next app in a single breath). In particular we would need to consider increasing the size of struct link and NIC transmit descriptor rings to avoid premature overflows.

The benefit of this activity would be more predictable behavior under overload and less code (hallelujah) since apps would always "blindly" receive and transmit and depend on the transmit function to drop packets that are over capacity.

kbara commented 8 years ago

I'd definitely like to see this in practice. In an earlier version of the lwaftr, checking whether two output links were full was a non-trivial slow-down, and it didn't actually come with any real benefits.

eugeneia commented 8 years ago

I also agree with this rationale. I have come into contact with link drops during #555 and having a clear definition of what it means for a link to drop packets is important to have, imho. I added the enhancement label so this issue can be found when “looking for work”.

plajjan commented 8 years ago

My theory is that the simplest and neatest solution is to never apply flow control within the app network. That is: apps will always receive and process all of their available packets and never check whether the output has capacity beforehand. During congestion packets will be dropped at the bottleneck point.

@lukego I don't agree with this, at least if "never" is really "never". I agree that this makes sense for vast majority of cases but you have to be able to implement QoS in Snabb, which is why this

The idea that makes most immediate sense to me would be for the basic behavior of apps to be (1) and for fancy algorithms like (3) to be implemented as their own apps.

makes sense.

I imagine anyone deploying Snabb today would do so in an ideal environment, like 10G in and 10G out and so the need for QoS is typically small but with time this could change. Could Snabb be the forwarding plane of a CPE? Yes, of course (btw, people are hacking to get VPP on RPi - kind cool!). That certainly means a need for QoS.

90%+ of use cases probably require no queueing whatsoever and can live with the simple link overflow you mention when CPU is scarce. Of the part that requires QoS, most will likely only need it on the last penultimate node, i.e. node closest to NIC.

In the VlanMux app I wrote I'm pretty sure there is a HOLB issue. Similar to this example:

ingress_nic -> app1 -> Tee ->  egress_nic1
                        |
                        +---->  egress_nic2

If you just look at what the Tee app does today, it will get the maximum number of packets it can send based on the output link that has the least capacity available. If egress_nic1 is congested and can take no more packets that number will be 0. app2 will thus not consume and build up pressure backwards through the graph. Your solution of blindly transmitting packets would solve this.

lukego commented 8 years ago

@plajjan Good input! This is indeed a subtle topic. I will make the case for a fully asynchronous design but I am willing to be convinced otherwise too.

HOLB

Your VlanMux example does seem like a case where you will be burned by head of line blocking as a consequence of the Tee app being blocking (synchronous / using flow control / using back-pressure). If one of the outputs is full then no input will be processed and that will also block the outputs that do have capacity.

I see this as a big problem. I think that providing both blocking and non-blocking apps makes it unreasonably hard for users to predict the behavior of their app networks. On the whole it would be simpler and better for apps to be exclusively non-blocking even if this requires us to write test rigs in a different style. cc @wingo.

QoS

Quality of Service is an interesting example. If we had a QoS app then it would need to do two things: detect how much capacity is available and then use a deliberate policy to choose which packets to drop. Backpressure would be one way to detect overload that might be neat in simple setups (QoS->NIC) but still problematic in complex ones (QoS->Tee->NIC). I would suggest that a simple QoS app could use a statically configured capacity (like RateLimiter) and a fancy QoS app might dynamically measure capacity e.g. via a shm object. In all cases the purpose of the QoS app would be to enforce a policy without buffering and causing delay.

Realistic? Are there more examples that make the case for synchronous/blocking/backpressure model?

plajjan commented 8 years ago

Just to be clear, my VlanMux doesn't use the Tee app. I just mentioned Tee since it's an example that's in master - VlanMux does however use the same method to get nwritable of all outputs as Tee does (it's copied from there), so they do suffer from the same problem.

For certain apps it might make sense to block when an output is full but for the VlanMux I think it shouldn't block just because one output is full. Like this topology, why stop forwarding Vlan2 stuff just because NIC1 is congested?

+------+    +---------+    vlan1   +------+
| NIC0 | -> | VlanMux | ---------> | NIC1 |
+------+    +----+----+            +------+
                 |         vlan2   +------+
                 +---------------> | NIC2 |
                                   +------+

There are multiple challenges here. The first one being that I need to look in the packet to determine whether it is going out on the vlan1 or vlan2 link. AFAIK if I do link.receive(input) then there is no way to put the packet back on the link. I was looking at the Tap driver and noticed it seems to peek the input link, read a packet and tries to send it out on the tap interface and only if it succeeds will it consume it from the ingress link. I could perhaps attempt something similar but I think I will quickly run into more problems. Can I peek/read/consume further back than one packet? Maybe, maybe not. Even if I could, so I could go through all packets of this breath I would still run into problems when I've consumed all vlan2 packets in the breath and for the next breath all packets are vlan1, thus blocking more vlan2 packets from ever being put on the link - i.e. the back pressure has now effectively moved back to NIC0 and there it's "stupid" drop.

This is very similar to the design of large routers, for example like this:

.-- LC0 --------.
|  +----+       |
|  | NP |----+  |   .---.
|  +----+    |  |   |   |
|            +----->| F |
|  +----+    |  |   |   |    .-- LC2 --------.
|  | NP |----+  |   | A |    |        +----+ |
|  +----+       |   |   |    |   +--->| NP | # port
`---------------'   | B |    |   |    +----+ |
                    |   |---->---+           |
.-- LC1 --------.   | R |    |        +----+ |
|  +----+       |   |   |    |        | NP | |
|  | NP |----+  |   | I |    |        +----+ |
|  +----+    |  |   |   |    `---------------'
|            +----->| C |
|  +----+    |  |   |   |
|  | NP |----+  |   `---'
|  +----+       |
`---------------'

With traffic form multiple NPs on LC0 and LC1 to LC2 you are likely going to congest from fab to LC2. Most boxes solve this through virtual output queueing so the ingress NP will lookup egress NP+port. That packet is then enqueued in a VOQ for that egress port but it happens at the ingress NP. LC2 NP can signal back pressure over the fabric to the ingress NPs so they know what to do with the VOQ. There are various issues with this technique but it's usually the most reasonable solution. A completely output buffered router would be the ideal but that requires completely non-blocking fabric, i.e. the from-fab capacity for every linecard is equal to the sum of to-fab capacity and egress NP can process all of this. That's obviously not feasible in real life but if we could do this we would have our egress QoS policy implemented in the most convenient location from a policy enforcement perspective - we would be able to match every incoming packet to the policy and enqueue / drop packets accordingly. In reality we do the VOQ dance and when we get back pressure we signal this back to ingress NPs which in turn will start dropping packets. The problem here is that the back pressure / distributed drop is less refined so we might drop the wrong packets. Imagine we have high priority VoIP packets coming in on LC0 and from LC1 it's a DNS amplification DDoS attack. As long as all the packets make it to the egress NP it will look at DSCP values or similar and prioritise the VoIP packets leading to only DDoS packets being dropped. As soon as the from-fab queue is full, we signal back pressure and all of a sudden ingress NP are supposed to drop packets but they might not do the full classification that the egress NP usually does, so instead we just drop X% of the traffic, leading to both high-prio VoIP packets and DDoS packets being dropped - "stupid drop" again. Actual details naturally depends on the implementation...

The CRS-1 has 2.5:1 speedup in fabric meaning that a linecard with 40G frontplate capacity actually has 100G from the fabric, so even with a DDoS coming in from 2 other cards going out a single card, the egress card can receive those packets and process them. In the CRS-3 Cisco lowered the speedup to something like 1.5. Customers weren't really ready to pay for it...

Blah bla, I'm rambling... that became a lot longer than intended. I think my point is that Snabb might have similarities with a large router, like if we have 40x10GE interfaces spread over 2 or 4 CPUs, each CPU is an ingress NP and the QPI bus is our fabric.. it will all become very complex and I think we should try to avoid that whole scenario, at least for now. Stick to the 90% cases and come up with a solution that fixes HOLB for those and still allows scheduling if that's what we want.

All apps that aren't explicitly some form of QoS app shouldn't have to think about this, so they should just put packets on their output links. If we are CPU constrained then we will drop packets, probably in an indeterministic way, but I think that's ok. It's easy to fix by adding more CPU. Assuming we are not CPU constrained the only potential congestion point is egress NIC and there we can have an explicit QoS app. Naturally it needs to be able to sense congestion on the NIC egress port and it might be that we want to do this differently than sensing backpressure via a link. If the link queue size is too big then it's impossible to do low latency queueing in software.

How does it work today? What's the queue depth on a link? What's the size of a breath? How are apps called to process packets, like does it follow the graph?

plajjan commented 8 years ago

@lukego also for the QoS part, the static thing is difficult to implement in reality since it's difficult to know just what that static value should be. If you put it "too low" you will waste capacity and your users will be unhappy with the disconnect between ingress capacity and egress capacity. Too high a value and you will queue in the NIC instead which completely defeats the purpose. VLAN tagging changes the value. The ethernet clock is not stable, which also affects this, in particular if it's a link running congested over long periods of time.

eugeneia commented 8 years ago

I previously had no concrete preference regarding this topic, but after being exposed to this in a couple more specific areas I think I can formulate and argue for a clear design. In my favored design every dropped packet is dropped explicitly. Whether an app is lossy (“asynchronous”) or lossless (“synchronous”) is part of the apps internal logic, and documented API. In brief:

Calling link.transmit on a full link should be an error. We remove the drop counters from links. Right now links are too fancy. Links should be dumb pipes, micro-buffers so to speak. Their implicit packet dropping behavior is a misfeature, and in my opinion always worse than the explicit alternative. (“Hmm, why do these links drop packets?” is always a bad question to have to ask, links should not have logic, its apps that drop packets, not links. Links must be trivial data structures.)
Most simple apps will be synchronous (basic apps, test apps). There is no benefit in having to wonder if packets were dropped by Tee, Join or Sink, they should have clearly specified, simple behavior, and reasoning about them must be trivial.
Most serious apps will have to drop packets in one case or another. It is imperative that each drop is done explicitly by the app. @plajjan's writeup gives me the impression that this is a wide field, that we should not attempt to “solve” with an implicit behavior of the framework. Power to the application developers. Let them decide when to buffer, how to buffer, etc.

In a way this design is already reality:

Every case where link.transmit drops a packet today is a bug. We pretty much don't understand when and why link drops happen, we only make sure that they do not happen in practice. Bold claim: every app that calls link.transmit on a full link does so by accident.
No serious app relies on links to drop packets. When an app drops a packet it wants to do so in a prominent fashion: packet.free(p); counter.add(dropped_because_xyz_counter).

I say obey the pattern that emerged naturally, do away with the wildcard feature that is link drops. Where to drop packets should be a problem solved by application developers and not by the framework. I suggest to provide the tools to solve the problem on an application by application basis (the link API), and avoid forcing a shoe for all sizes (links dropping packets).

plajjan commented 8 years ago

@eugeneia you might be right that this should be done explicitly in the app. What I want is simplicity and deterministic behaviour which is my issue with the current incarnation of the Tee app. @lukego mentions he prefers non-blocking apps. Is the solution to simply drop packets if there isn't enough space on the output links?

What reasons are there for a link being full? What's the queue depth of a link? What is the normal size of a breath?

lukego commented 8 years ago

What reasons are there for a link being full? What's the queue depth of a link? What is the normal size of a breath?

My suggestion is that the engine regulates the rate of incoming packets to around 100 per breath, links can hold up to 1,000 packets (10:1 safety margin), and links are expected to overflow under a diverse range of exceptional circumstances but not in normal operation.

Here are some examples of where a link might fill up and cause packets to drop:

Receiving app is out of service and not processing packets. For example it could have crashed with an exception and be waiting for a scheduled restart by the engine or it could be a NIC whose link is down.
Receiving app is running slowly e.g. the link is across processes and the receiving process cannot keep up with the workload.
Extreme load of some kind e.g. very many input sources all converging on the same link and the chosen buffer size being insufficient.

I hope that this behavior would be predictable and graceful in overload situations. That is: use the capacity you have, drop the excess without introducing delay, make the situation observable (via counters) to the people responsible for capacity planning in the network.

To be a little more vivid: imagine you are an ISP experiencing extreme load e.g. your customers are all streaming the world cup in HD for the first time, you are being DDoS'd, and a data center outage has wiped out half of your capacity. This will push all of your network elements beyond their specifications. How should they cope? I would hope that they focus on using the capacity they have as efficiently as possible and providing quantitative information about how load is impacting their operation. This is not the time you want to be triggering bugs and edge-cases in exotic code paths.

@eugeneia I am not immediately sure how to take a step towards common ground. I agree with your sentiment but not the details. If a system is overloaded then the last thing I want it to do is report assertion failures and restart. Mixing both lossy+lossless is complex, like eager+lazy evaluation in a programming language, and risks kicking this problem down the road until it surprise)s somebody in production. On the internet the default behavior to overload/congestion is to drop the excess as quickly as possible and then continue: for an app to have something clever to do (apply QoS rules; send a TCP Explicit Congestion Notification; send an ICMP Source Quench) will be the exception rather than the rule. Saying the current Snabb behavior here "emerged naturally" seems an overly generous way of saying that the person writing the code used the first idea that came to their mind. The asynchronous networking model is the one on the right side of history i.e. simple and predictable design that keeps on killing more complex alternatives (and their successors) when looking at how the internet has killed off competitors like traditional telco networks.

Could be that my views are too extreme. I would be curious to know @alexandergall and @sleinen's perspectives on this issue.

lukego commented 8 years ago

Reflection: This Issue should perhaps have been a Pull Request from the beginning so that we would be talking about concrete code changes and could reach a conclusion with a merge or refusal to merge.

plajjan commented 8 years ago

@lukego maybe we have completely different philosophies here but I prefer to separate the problem statement (issue) from potential solutions (PRs), which allows us to write many of the latter without discarding the former.

plajjan commented 8 years ago

When does an app:push() gets called? Exactly once during a breath? What's the order? Does that mean you can never build a program with appFoo1..11 (i.e. 1100 packets) sending packets to appBar without dropping packets? I guess push() is run-to-completion so at least it can process it's whole batch.

Would it make sense to schedule appBar:push() once we see more than X packets on its link?

lukego commented 8 years ago

@plajjan Today there is a tiny little breathe() function that implements this logic:

Call pull() on every app that defines this method. That is, every input source can bring in packets on each breath.
Call push() on every app that received new traffic on an input link.
Repeat step 2 until idle.

The original implementation - still mostly in place - is that on pull() each app tries to fill up its output links. This is max 256 packets per link. However, the intel_app has an artificial input limit to burst maximum 128 packets per breath. This seems empirically to be a sweet-spot and I think we should generalize this mechanism to control the total number of packets that we pull() into each breath. There are a few ways that could be implemented, I am not sure which is best.

plajjan commented 8 years ago

@lukego ah, I see, so push() will run repetitively until app network is empty. Makes sense.

Overrunning a link buffer is probably quite exotic but if it's not then I suppose one would want to preempt and run the apps with the fullest input links before others. I think a naive approach is simple to implement but there are pitfalls (like you might want to start processing from egress side of app network) here so current scheme is prolly best kept until we have proven use cases causing problems.

petebristow commented 8 years ago

Is is particularly meaningful to know which app is dropping packets due to overload? Presumably the back pressure is being caused by an exit point from the app network be it IPC link / NIC / Tap interface and you could just track it from there. The more overloaded the system the closer the 'dropping' app will be to the source of the packets. If a ratelimiter app is dropping packets because your trying to send 11meg down a 10meg cap then that's meaningful but tracking of that is something internal to the ratelimiter app.

@plajjan I don't think push() is run until the app network is empty. If an app stalled because it's output link was full it wouldn't get run again if it's output link subsequently emptied whilst running push() on the rest of the app network unless one of it's input links saw more traffic.

The contract between breathe apps in the network is a little bit hazy. @lukego My reading of breathe suggests you missed a step 1a. Call push() on every app, regardless of traffic on the input link, Repeater wouldn't work without this.

The current 'contract' is defined as

— Method myapp:pull Optional. Pull packets into the network. For example: Pull packets from a network adapter into the app network by transmitting them to output ports. — Method myapp:push Optional. Push packets through the system. For example: Move packets from input ports to output ports or to a network adapter.

What if we amended this contract. — Method myapp:pull Optional. Pull packets into the network, this is the only way an unbounded number of packets can enter a snabb network. For example: Pull packets from a network adapter into the app network by transmitting them to output ports.

—Method myapp:push Optional. Push packets through the system.

The number of new packets that a push() can introduce into a network must be bounded by some function over the number of packets received on an input link. Generating a packet in response to an ICMP echo request is fine, while true transmit(o, packet.clone(self.packet)) is not.
A push() should not transmit() packets to a full output link, unless they are leaving the app network, NICs / IPC links can drop packets buffers are full.
After a sufficient number of push() calls all input links to the app should be empty.

eugeneia commented 8 years ago

My suggestion is that the engine regulates the rate of incoming packets to around 100 per breath, links can hold up to 1,000 packets (10:1 safety margin), and links are expected to overflow under a diverse range of exceptional circumstances but not in normal operation.

@lukego You have convinced me of this strategy. I had thought of an app network as a black box, but given the generous safety margin I think the behavior will be the same for the simple cases I had in mind, and in the exceptional cases the app network state will reflect the real world and give insight to the problem. E.g. following my suggestion in an extreme case the packets would be dropped at the front or the end of the app network without really knowing which app/link is the culprit, while with your suggestion it would be visible which parts of the app network can not account for the load.

So in addition to increasing link sizes, I think we should remove link.full. E.g. get rid the widely used pattern of transmitting only when there is room in the output link, or refrain from allowing apps to be synchronous/blocking.

plajjan commented 8 years ago

@eugeneia should we invent a new mechanism to know wether a NIC is congested first? I guess link.full() together with relying on back pressure on the link to the NIC is the only method we have today, right?

I think this is a case of leading by example and where the model of keeping all apps in one repo is a good one. Just rewrite everything to not use it full() and I'll bet we won't see any use of it :)

eugeneia commented 8 years ago

What if the push method of peripherals where blocking? This kind of goes against the asynchronous principle, but is there a better way to deal with this scenario? Is there ever a case where you want to drop packets at the app network ends? Why generate a packet if it will be dropped?

eugeneia commented 8 years ago

See #950 for an implementation of the asynchronous behavior described in https://github.com/snabbco/snabb/issues/656#issuecomment-209754177

snabbco / snabb

Buffering in the app network: when to hold and when to drop #656

HOLB

QoS