p0pr0ck5 / lua-resty-waf

High-performance WAF built on the OpenResty stack
GNU General Public License v3.0
1.27k stars 304 forks source link

Interesting #35

Closed splitice closed 8 years ago

splitice commented 9 years ago

Hey,

This project looks quite interesting. Its obvious you have put quite a lot of hard work in already :+1: It is also pretty remarkable the features available given the size of the code. Working on understanding it all currently.

  1. Is there any documentation on which CRS rules are implemented? It might be a good idea to insert a comment above converted rules with the original rule to make it easy to see gaps. I am guessing you might be missing some of the features needed to do the complete ruleset (that and honestly some of the rules would be quite a performance drain!)
  2. Re the benchmarks conducted: was OpenResty compiled with lua or luajit?
  3. Have you considered using a nginx pre-processor for debug logs? Its fairly common for large systems (e.g nginx) to be compiled with debug logging for greater performance. The pre-processing could leave the debug log call in lua comments so that it runs in production mode without processing.
  4. If you happen to be against to this, perhaps consider making debug logging a global property to the worker, then if it is disabled - overwrite the _log function with an empty function. It would be better than nothing. Still, personally I dislike that complexity and would prefer point 3.
  5. I would do the same with disabling pcre optimization, its both a user support risk and a mostly pointless option outside of debugging.
  6. In LUA string concatenation is doubly expensive as the hashing is done at this point too, when concatenating a few or more strings `table.concat`` is often faster
  7. Is there any lists of tests you have performed? I am interested in doing testing as to the performance of various scenarios - to make sure there are no resource exhaustion / DoS attacks lying in wait.

I will continue reading, but I am interested in getting this up and running - and of course contributing back. Feel free to ignore any questions that seem a bit ignorant at all, I haven't got it running yet so I might be mistaken on a point or two.

At a glance this does most of what I need it to do already. And hopefully it will be able to meet our performance requirements (looks to be very very close already).

Thanks in advance for reading all this :)

p0pr0ck5 commented 9 years ago

Hey, thanks for checking it out! Super exciting to see people interested in this!

  1. There's not an exact one-to-one correlation between the CRS rules and the included rulesets. In some cases the logic has been changed a little bit, in some cases the ModSecurity ruleset either didn't apply or it's not something we support (i.e., strict multipart upload parsing, outbound checking, etc). In some situations we don't mimic the logic exactly because we don't have those features (like CSS/JS decoding) or they don't make sense (like ModSecurity rule 960911, which checks for a valid request line; this is an expensive regex and Nginx wouldn't even process a request that matched this rule- it would return a 400 before hitting the access phase). With that in mind, it might be worth it to better document the existing relationships. I'll look into it!
  2. Benchmarks were run with the most recent stable version of OpenResty (1.7.10), which users LuaJIT by default.
  3. This would be nice, since building the debug strings is so expensive, but I don't think this type of preprocessing (like C preprocessing) is available within the Nginx Lua module. I think the best we could do to mimic that is wrap the _log() calls in an if statement (instead of having the if check in the function itself). I may also look into that, but it seems like a lot of work for not much gain (I'm a sysadmin, I'm lazy ;)). If you're really looking to squeeze out performance, you can strip out the debug calls entirely:

sed -i '/\s*_log(/d' fw.lua

But I know that's not the best way to do things ;)

  1. I'm not a fan of global variables, and I think configuration like that should be left in the module itself.
  2. Yeah, I don't know how much value is in it, I threw it in during some debugging (see issue #24). I don't think leaving it in will hurt.
  3. Yes, this is true. I'm curious as to how much the gains of removing string concatenation would be offset by the overhead of table creation. I'll do some testing to see where the threshold is. The debug logs generally don't have more than one or two cat's per call.
  4. No formal or unit testing yet, it's largely empirical testing done on my end. I was thinking of mocking up some tests using Test::Nginx::Socket, but frankly, that idea bores me (again, I'm a sysadmin, I'm lazy). Contributions welcome if you want to add some testing into the project, or if you do some research on your own and want to share I would love to hear it :)

Please feel free to poke me with any more questions you have, I'm always happy to chat!

splitice commented 9 years ago

hehe Sysadmin here too. I very much agree with your sentiment, pending a your answer was planning on just doing a command like that to remove it then :)

A few tests, even just as a module that can be called would be a good idea from my perspective to make sure I don't break too much in the early states while poking it. I am happy to write a Travis CI project to CI it.

Currently I am looking into a few areas:

  1. Feature completeness - I need to it to be able to do enough of the important rules. And I don't like the huge resource drain that mod_security is - it does too much for what we want to do with it.
  2. Modularity - while its great in its current standalone state, we need to be able to inject new loggers, actions and storage mechanisms. Preferably without needing to maintain a large patch since we want to contribute back easily :)
  3. Resources - Particularly memory & scalability but for now I'll start with some low hanging fruit in CPU performance.
p0pr0ck5 commented 9 years ago

Awesome, if you do end up writing a CI project let me know :)

  1. Feel free to request specific features if you don't see them here. I don't have anything new mapped out in the immediate future so let me know if you're looking for something in particular :)
  2. Saw your pull request. Re-architecting some of the guts to make it more modular seems like it could be worth looking into. Give me a few days to chew on it :)
  3. By far the biggest burden is going to be CPU workloads, because regexes are expensive, which is why we default to cached JIT regex compilation and recommend using a newer PCRE lib. If you haven't yet, you can check out some of my notes I've made over the last few months for performance changes in issue #8. Memory shouldn't be an issue, unless there's an internal memory leak within Nginx itself; generally memory usage should be nearly static throughout the lifetime of the worker.
splitice commented 9 years ago

Currently I have a lot of ideas and its only been a day. My main experience with lua in nginx is writing a Layer7 DDoS mitigation environment so I've had to do a lot of work in the performance area.

Given our environment we need it to scale to some pretty big extremes, many server blocks (and corresponding enabled and disabled rules) and during some pretty bad days we expect to see many thousands of req/s and we don't wish to take too much CPU & RAM away from the rest of the system should anything leak from mitigation.

So far I have found a few things that come to mind (so far) -

  1. Separation of rules from request - we would like to be able to define the rules at the init_per_worker level (engineered to be optional of course since not everyone has that configuration requirement). Then each server block only needs to specify a list of included or excluded rules. This also opens up the ability to do some pre-optimization and more importantly validation (and less table creation during the request). Loading mostly duplicated rules for 1,000+ location blocks does not appeal me resource wise. I am not sure the SHM approach is right here...
  2. HTML entity decoding, looking at the code, your comments and benchmarks it looks to be quite costly (and I can see why with stacked string operations) - its easy enough to do it in a C module. An optional drop in module would be nice, overriding the lua definition if available. From my point of view of attacks seen in the past HTML decoding its a very important part of WAF rules.
  3. Lots of string & table operations (so much tostring) in your code that can be improved or in some cases removed.
  4. We need to be able to modularize anywhere that there is ngx.exit based request control - this needs to fit inside an existing system. We just need to wrap it with a guaranteed exit. Fortunately this looks easy enough.
  5. The _table_ methods taking a reference to the main WAF seems a bit backwards. I'd like to look at a code cleanup in that area. Given table creation isn't cheap it might be possible to reduce the dependency on these functions anyway.
  6. Lots of modularization. Lots of this from what I can see can be bundled into the system as a small performance bump in getting rid of double indirection from tables. Not that its a major optimization. Its fairly easy to do since really all that needs to be done is to make the local tables to the functions a property of the module. Off the top of my head I am thinking
    • Modular actions (I want actions like BAN and to be able to feed DROP etc into a ban management system). For the most part its just a matter of exposing the lookup tables as a member of the module so I can add/remove. Nothing complex needed.
    • Modular logging - already got the infrastructure for a multiserver logging system one function call away
  7. I am a bit of a fan of CI, since its produced great results for us where applied. I haven't ever tried to CI nginx-lua but it doesn't look too hard. If a Test::Nginx::Socket test was defined I could take it from there. Blatant bribery there, perl urgh.

I am starting with some of the easier stuff at the moment (performance-tweaking branch), small wins that don't impact much (function, code quality, loc etc). I figure that's the easiest way to get a full understanding. I fully expect you to have problems with some of it though so don't worry. Just tell me I am silly if I change something you did on purpose or disagree with, I wont bite.

I fully acknowledge our requirements are not yours and I am sure not all of it will be what you want, or even be the entirely right direction (reverting is learning too). Where possible I'll try and get it to a state that you can merge it in, since its less for me to maintain in the long run as a patch. I am going to put some serious hours in this & next week to try and get a PoC solution for testing internally in our system as a whole.

p0pr0ck5 commented 9 years ago

Sounds like you've got a lot in mind ;) I'll address a few things that popped out:

  1. Not quite sure I understand this one, since modules are loaded globally. Right now each server (or location) block can already specify rules to ignore.
  2. Yes, string manipulation is costly. That's why we cache data transforms per request, so if a certain data collection (like REQUEST_ARGS, which is used throughout the XSS/SQLi rulesets), it's only transformed once- and my comments have reflected the need to cache this data so we don't waste cycles doing the some operations 100 times in a single request. I've found that, during regular processing, the vast majority of CPU time is spent on regex processing, which is handling by the Nginx Lua API, not our own code. I'd be interested in perhaps looking at some optimized transformations in C via FFI, not sure how we could move transformed data from a separate C-level module (my C isn't very good, which is why this is written in Lua :) )
  3. Yes, there's definitely some cleanup that can be done. I've taken a look at some of your patches on your performance improvements, they look promising. I'm looking forward to some outside contribution here :)
  4. Yes, this is something I would be interested in. Right now there's not much attention paid to interacting with other sections of an access phase handler.

Let me know if you have any questions about the logic or the code itself, I know reading someone else's work isn't always easy (and this has really just been a fun project for me, not serious development work, so I'm sure there are many more people more knowledgeable than I that can contribute).

splitice commented 9 years ago

Progress so far - Requests per second: 6230.39 #/sec Started with: Requests per second: 5227.52 #/sec

Calculated using ab -k -c 20 -n 100000 http://192.168.56.102/ from the same machine (a virtual machine restricted to a single i7 core)

The biggest single improvement is the offloading of work into multiple contexts. The configuration for which looks like -

init_worker_by_lua '
        local FreeWAF = require "FreeWAF.fw"

        FreeWAF.preload({ 10000, 11000, 20000, 21000, 35000, 40000, 41000, 42000, 90000, 99000 })

         -- setup FreeWAF to deny requests that match a rule
        FreeWAF:set_option("mode", "ACTIVE")

         --fw:set_option("debug", true)
    ';

    server {
         ...access_by_lua '
                local FreeWAF = require "FreeWAF.fw"

                -- instantiate a new instance of the module
                local fw = FreeWAF:new()

                -- run the firewall
                fw:exec()
            ';
    }

So far:

Next I'll look at the rule structure, some wins can be found (i.e function pointers instead of table lookups of operators) by "pre-compiling" if its preloaded. Not sure if I will be able to do it easily however without inducing a bit of a hit for non-preloaded rules.

splitice commented 9 years ago

Still - there's much room for improvement when actually executing through a more serious number of rules and transformations.

# ab -k -c 20 -n 100000 http://192.168.56.102/?a=aa
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.56.102 (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests

Server Software:        openresty/1.7.10.1
Server Hostname:        192.168.56.102
Server Port:            80

Document Path:          /?a=aa
Document Length:        612 bytes

Concurrency Level:      20
Time taken for tests:   100.243 seconds
Complete requests:      100000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    99001
Total transferred:      85595005 bytes
HTML transferred:       61200000 bytes
Requests per second:    997.58 [#/sec] (mean)
Time per request:       20.049 [ms] (mean)
Time per request:       1.002 [ms] (mean, across all concurrent requests)
Transfer rate:          833.87 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       3
Processing:     0   20 187.3      1    2571
Waiting:        0   20 187.3      1    2571
Total:          0   20 187.3      1    2571

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      2
  95%      2
  98%     19
  99%     24
 100%   2571 (longest request)
splitice commented 9 years ago

To give you an idea of the potential gains of pre-processing. fd812587e1a41f9c9a80ab2bd8c07d27dcca74b1 increased rule processing performance by 25%. And locally Ive got another 25% being worked on :)

Requests per second:    1399.23 [#/sec] (mean)

Ultimately pre-processing the entire rule into a lua function call is my ultimate aim - although its difficult with the current structure so I'll focus on the individual components for now.

splitice commented 9 years ago

Last update for me for the night ee3b781f1c6683d31d13d6abc0f0a8a08705a325

# ab -k -c 20 -n 10000 http://192.168.56.102/?a=aa
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.56.102 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:        openresty/1.7.10.1
Server Hostname:        192.168.56.102
Server Port:            80

Document Path:          /?a=aa
Document Length:        612 bytes

Concurrency Level:      20
Time taken for tests:   6.871 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    9901
Total transferred:      8559505 bytes
HTML transferred:       6120000 bytes
Requests per second:    1455.40 [#/sec] (mean)
Time per request:       13.742 [ms] (mean)
Time per request:       0.687 [ms] (mean, across all concurrent requests)
Transfer rate:          1216.55 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       3
Processing:     0   14 111.8      1    1343
Waiting:        0   14 111.8      1    1343
Total:          0   14 111.8      1    1343

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%     12
  90%     13
  95%     14
  98%     15
  99%     19
 100%   1343 (longest request)
p0pr0ck5 commented 9 years ago

Sorry, been busy with life the last few days (and will be for a few more, but got a small slice of free time). Some of your branches look interesting, I would be interested to see how these changes compare using tools like https://github.com/openresty/nginx-systemtap-toolkit and https://github.com/openresty/stapxx

splitice commented 9 years ago

So would I but alas I have never been able to get systemtap / dtrace to execute. Always freezes on me. Very similarly to what another person experienced on OpenResty's mailing list recently. To be investigated when I have time.

Instead currently I am using a stopwatch module and timing areas of interest. Then using apachebench for benchmarking.

p0pr0ck5 commented 9 years ago

If you have a stable branch/commit you would like to submit I can clone it into my test harness and generate a flame graph and lua exec time reports.

splitice commented 9 years ago

My latest commit is https://github.com/splitice/FreeWAF/commit/ee3b781f1c6683d31d13d6abc0f0a8a08705a325

If any issues are identified in a flame graph I would be happy to fix it.

p0pr0ck5 commented 9 years ago

Sorry I haven't responded in a while, I've been very busy with work and outside responsibilities. The amount of time I'll be able to commit to the project in the near future will be diminished significantly. I'll still try to get to profiling your commits at some point soon, and if you have a stable and solid set of changes you'd like to submit I'd be willing to take a look.

p0pr0ck5 commented 8 years ago

Going to close this as stale. There were good ideas presented here, and we can use this as a discussion base in the future.