snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.98k stars 301 forks source link

Firehose -> fireblock for DDoS mitigation? #564

Open plajjan opened 9 years ago

plajjan commented 9 years ago

As discussed on twitter, I'm pondering whether the firehose program can be tweaked to fit my use case.

A while back I wrote an app for DDoS mitigation, see https://github.com/plajjan/snabbswitch/tree/ddos-rules/src/apps/ddostop. The outline of the logic is described in the README but I'll give you the essence for the sake of this discussion.

My app looks at packets, matches them again PCAP/BPF filters and calculate pps/bps rates per source IP address. If a rate goes above a given threshold we start blocking that source IP. My implementation is inspired on what Arbor calls "zombie detection", which I've found was one of the most useful type of mitigations. It's simple and very flexible.

There are more or less two states that a source IP could be in, either it's blocked or it's not. If we are in block state we only dig out the src IP and then throw away the packet. You can look in my source code, I specifically avoid the standard ethernet and IP header parsing of Snabb since it was faster to implement my own crude version that only looks in Eth header to determine if it's IPv4 or IPv6 and then look at given offset to retrieve source address. This bit is quite fast and I've seen like 7-8Mpps in real world on a fairly old CPU.

If it is not blocked we match with PCAP filters. This is much more expensive and IIRC I got something like 1-2Mpps out of the same box.

When under an attack most traffic is bad traffic (duh), so as soon as we've counted a few packets and seen sources go over the threshold we can block them and thus enter "fast path".

I'm thinking if I could use this firehose program somehow to speed up my block stage/"fast path" - let's call it "fireblock"! Fireblock would need to look in ethernet header to determine v4/v6, retrieve source IP and match against some fast filter (bloom + something???). If we have a match it drops the packet. If there is no match the packet needs to be sent for further processing, which could stay in Snabb, so I would need to pass packets from fireblock to Snabb for final processing.

Further, since my app in Snabb takes the decision on whether to block a host or not, I would need some form of IPC from Snabb to fireblock to signal the list of blocked source IPs.

Since "slow path" (i.e. PCAP matching) stays in Snabb I expect it to continue to be slow but that is okay. Again, most traffic during an attack will be blocked so just speeding up the block stage will give tremendous improvement to the system as a whole.

What do you think?

lukego commented 9 years ago

Great topic, Kristian. I have pondered a bit and I actually think there are multiple good ways we could bring the firehose-style optimization into the app network.

One would be to have a firehose app that takes a callback function (C or Lua) and uses that to decide whether to output a packet vs. leave the buffer in the NIC RX ring for reuse.

Related ideas could be to optimize the hell out of the "slow path" in general or even add the callback capability described above to the standard Intel10G app.

I think we could do all of these things using standard apps and without needing the firehose program front-end. That could be left as a special-purpose interface for people who are not interested in the app network and only want a simple way to deliver packets to a C function.

btw did you try the pcap filtering since we merged pflua? Alex Gall saw a big speedup compared with the older implementation that called out to libpcap.

plajjan commented 9 years ago

The most convenient thing would obviously be if we could simply take parts of firehose and integrate into Snabb, yielding an overall performance boost but I suppose that the flexibility provided by Snabb will never perform the way a simple program like firehose does. For the time being I'm happy if I can reach linerate 10G per core (obviously my demand will go up as I go to higher NIC speeds and as long as the Snabb process-model stays).

If you have callback in the standard Intel10G app, how would that look? How would that work compared to the app network of today? Like a pre-process hook? So I can pass the packet from the hook onto the regular app network? Most of my packets would just be discarded in the hook.

I'd be happy to optimise the slow path but I don't really know how. I would like to keep PCAP matching and short of removing that I'm not sure how to speed things up. Suggestions are very welcome!

As for testing on latest, I just ran a test, unfortunately with a decrease in performance.

Old:

source sent: 31745
repeater sent: 110386950
sink received: 68
Effective rate: 11038695.0

New

source sent: 31745
repeater sent: 75812520
sink received: 68
Effective rate: 7581252.0

This is run on my laptop with the selftest, so no real hardware is involved.

BTW, if you want to run the selftest on my app don't be discouraged by the "test failed". It runs two tests, one that tests the logic and one for performance. It is the logic test that fails because it expects the source to be blocked after 20 packets but for some reason it starts blocking at 21 packets. Not a big deal - the app works after all.

plajjan commented 9 years ago

I was a tad naive thinking that pflua was a drop-in replacement of the older pcap lib. I realise that wasn't the case so I updated my code but performance is pretty much on par :/

To test my slow-path I simply write a rule that doesn't match any of the packets in my test-pcap so that I stress the packet matching.

Old (lib.pcap):

Effective rate: 737514.7

New (pflua):

Effective rate: 713727.8
lukego commented 9 years ago

I reckon we could find a suitable way to have a preprocessing hook with the same performance that Pavel is seeing with firehose. Question is whether it should be a new app or a feature of the standard driver.

It would also be neat to create a simple-but-realistic snabbmark benchmark to represent your application. Then we could track the performance for optimization and avoiding regressions. I wonder if in the future we will have 20 small benchmarks in snabbmark and each will demonstrate a recommend programming style and reference performance for a certain kind of application? That could be neat.

One step at a time...

plajjan commented 9 years ago

When you say "a new app", do you mean like a standard app that we have today? So I could do intel10g.rx -> fireblock.input, fireblock.output -> ddos.input ...? That would be sweet, above all because it fits the current standard model.

A callback in the driver is a bit hackish and I think it should be avoided if possible. If the performance gain is big enough though, hacks like this are still worth the downside of added complexity and some ugliness ;)

Could you elaborate further on what you were thinking about for snabbmark? Something like the basic1, where we hook up source -> ddostop -> sink and measure? Or do you mean to write more explicit components to exercise pcap matching (slow path) / src block (fast path)?

ddostop includes performance testing in its selftest. Not sure if there are rules for what the selftest should contain, like only logic testing or performance as well. Is it generally desirable to put performance tests in snabbmark rather than in an apps selftest?

plajjan commented 9 years ago

Speaking of 'optimize the hell out of the "slow path"', I just found a major and silly cause of my slow path being so slow. pairs() vs ipairs().

I'm afraid I have to come up with a new name for the "slow path" now as performance went from 750kpps to 8-9Mpps.

See https://github.com/plajjan/snabbswitch/commit/0913bbe8a86c39e728e511218dfe6ecaca167aea

@lukego if we ever have a recommendation on how to write fast things, this should probably be in there. It might be obvious for someone with a Lua background, but for a novice like myself it's less than obvious and worth pointing out. It's perfectly natural once I read the documentation but I just hadn't reflected over there being both a pairs() and ipairs().

lukego commented 9 years ago

Yeah. Welcome to me and @alexandergall's world :-). Optimizing the Lua code is still a black art but we are getting incrementally better at it. I would like to build up a good collection of reference examples (so that you always know approximately what performance you should be getting) and better diagnostics so that it is much easier to find these tiny details that annoy the JIT.

I am optimistic that we will make this process much easier but it will take some time.

Congratulations on your successful optimization campaign now anyway :-)

lukego commented 9 years ago

@plajjan btw check out LuaJIT 2 Optimization Guide.

The trouble for me is that I hate taking optimization advice that I don't understand and it is taking me some time to really understand what is going on under the hood... probably everything they say is right though :).

plajjan commented 9 years ago

@lukego another interesting change, credits to Daniel Barney - https://github.com/plajjan/snabbswitch/commit/d9d38ba82e4ef4bc5d25008f539e7bfbd10b6c81

plajjan commented 9 years ago

I found another weak spot that I wasn't really expecting. I've been using PcapReader -> Repeater for my tests and the pcap file I used as input was a 2MB file with 32k packets in it. Most of the packets were the same (5 ICMP packets and then 31995 identical NTP packets) so for my test I extracted one NTP packet and just loop that instead. Performance selftest went from ~10Mpps to closer ~29Mpps. Apparently I spent too much time looping test packets.

Again on my laptop. Interestingly, my laptop seems to be faster than Chur. I didn't really expect that. It's an i7 and I bet it has like that frequency boost for a single core if nothing else is running, which probably kicks in and helps me out. I will do a test on chur with sending packets over real NICs to verify performance.

Guess this is another tip for that perf guide we should write ;)

lukego commented 9 years ago

@plajjan Interesting :). Yes, I am also really interesting in turning optimization of our Lua code from an art into a science. This will take some thought :). Good to be gathering more experience in the community.

Indeed chur is a pretty slow machine: 2.0 GHz Sandy Bridge. Fast-clocked i7 CPUs do seem to really beat it. Generally the strength of the higher-end Xeon E5 chips is to have many cores and if you don't need that you are probably much better off with an i7.

I am really curious to see the Skylake CPUs hit the market in the coming months and see what new SIMD capabilities each model has from the AVX512 feature family.