owasp-modsecurity / ModSecurity

ModSecurity is an open source, cross platform web application firewall (WAF) engine for Apache, IIS and Nginx. It has a robust event-based programming language which provides protection from a range of attacks against web applications and allows for HTTP traffic monitoring, logging and real-time analysis.
https://www.modsecurity.org
Apache License 2.0
7.82k stars 1.56k forks source link

Support for RE2 regular expression engine #1996

Open WGH- opened 5 years ago

WGH- commented 5 years ago

RE2 (https://github.com/google/re2) is a regular expression engine written in C++ and developed by Google.

Unlike libpcre, RE2 runtime is always linear to the size of the input. It comes at some cost: the engine inherently doesn't support certain features, like backreferences and lookaround assertions.

Still, even the CRS has some regular expressions in its rules that run very slowly (minutes) on certain inputs, but are supported by RE2 and run much faster (seconds) on the same inputs.

The idea of adding RE2 support to ModSecurity has been floating around for a while, and there's quite a few mentions of people working on things related to it:

I have been unable to find neither issue nor public pull request/fork/branch of ModSecurity with RE2 support.

I have a more-or-less working proof-of-concept patch for ModSecurity with optional RE2 support (with fallback to libpcre if the regexp fails to compile on RE2).

Is someone else working on this? Because if yes, it's probably not worth it to duplicate efforts. If not, I think I can clean up my patch a bit and submit it here as a PR.

zimmerle commented 5 years ago

Hi @WGH-,

Thinking in that possibility, in ModSecurity v3 we have this as plug able interface. So that we could support provide for different regular expression engines. It just demands a single file change, here:

https://github.com/SpiderLabs/ModSecurity/blob/3c1fba278c14fe9b63cff80a3ae32df82ba042ac/src/utils/regex.cc#L41-L129

Ideally the regular expression back-end could be selected with a compilation flag, selected during the configuration.

Hyperscan can be also taken into consideration.

WGH- commented 5 years ago

It seems nobody is sitting on an unpublished patch (or at least doesn't admit to), so I'll try to clean up and submit mine this week.

zimmerle commented 5 years ago

Great to hear that you have a patch already. Let us known if you need further assistance.

zimmerle commented 5 years ago

@WGH- starts the work in that feature with an excellent patch submitted at #2003. Patch is already merged.

zimmerle commented 5 years ago

Hi, @WGH-,

As commented on your patch review, I've made a branch with a couple of suggestions on a refactoring, to make our code a little bit more elegant. Also, I've made some suggestions on how to tackle the architecture to support multiples regex engines. Instead of write about I guess it was faster just to had, it implemented. The code is available here: v3/dev/re2.

As of now, I think it is Ok to have the selection on compilation time. Further, we may have a run-time configuration specifically for that.

This is the output of my configure:

ModSecurity - v3.0.3-43-gcd1a8825 for Linux

 Mandatory dependencies
   + libInjection                                  ....v3.9.2-30-gbf234eb
   + SecLang tests                                 ....e6b03e4

 Optional dependencies
   + GeoIP/MaxMind                                 ....found 
      * (MaxMind) v1.3.2
         -lmaxminddb , -DWITH_MAXMIND 
      * (GeoIP) v1.6.12
         -lGeoIP , -I/usr/include/ 
   + LibCURL                                       ....found v7.63.0 
      -lcurl,  -DWITH_CURL_SSLVERSION_TLSv1_2 -DWITH_CURL
   + YAJL                                          ....found v2.1.0
      -lyajl , -DWITH_YAJL -I/usr/include/yajl 
   + LMDB                                          ....disabled
   + LibXML2                                       ....found v2.9.8
      -lxml2 -lz -llzma -licui18n -licuuc -licudata -lm -ldl, -I/usr/include/libxml2 -DWITH_LIBXML2
   + SSDEEP                                        ....found 
      -lfuzzy -L/usr/lib/, -DWITH_SSDEEP -I/usr/include
   + LUA                                           ....found v503
      -llua5.3 -L/usr/lib/, -DWITH_LUA -I/usr/include
   + Regular expression engine
      * RE2 (experimental)                         ....found v0.0.0
         -lre2 , -DWITH_RE2 -std=c++11 -pthread 
      * PCRE                           [selected]  ....found v8.42
         -lpcre,  -DWITH_PCRE -DPCRE_HAVE_JIT

 Other Options
   + Test Utilities                                ....enabled
   + SecDebugLog                                   ....enabled
   + afl fuzzer                                    ....disabled
   + library examples                              ....enabled
   + Building parser                               ....disabled
   + Treating pm operations as critical section    ....disabled

Notice the [selected] which appears by default on the pcre, in case the user uses --without-pcre it will select RE2. I am not sure if that is the best approach, let me known what do you think about it.

This branch is just for reference. If you have something else in mind, please share.

If you want to discuss something, my Skype is felipezimmerlebr. Hangouts: felipe@zimmerle.org.

I am interested in doing a blog post with the performance comparison in between the three options: re2, pcre and HyperScan, it will be amazing if you want to do it together.

zimmerle commented 5 years ago

The verify_cc implementation is not using the regex encapsulation:

https://github.com/SpiderLabs/ModSecurity/blob/1f5e168370af001678ffb450026566bd3c9492fa/src/operators/verify_cc.cc#L130

That will be in our way.

WGH- commented 5 years ago

I had slightly different idea.

Since RE2 doesn't support certain constructs, it might make sense to do a run-time fallback instead. That is, if RE2 support is built, every regular expression is first attempted to be compiled using RE2, and if that fails, it's compiled with libpcre instead.

This way RE2 can be enabled even if some rules are not supported by RE2, and at the same time, rules that are supported by RE2 enjoy better worst-case speed.

zimmerle commented 5 years ago

Is that any specific ModSecurity-wise regular expression that concerns you regarding functionality? As far as I understand the syntax may be a little different in some corner cases, but both can accomplish the same mission. That said, we may not need a fallback. In case the rules are not syntax-wise supported and the user wants to use re2, he can always change the rule. Unless some functionality is indeed missing.

The thing that concerns me the most about this run time thing is the usability. I don't want to the user to read a manual to understand how regular expression works in modsec. It should be simple and natural.

If fallback is the way that we want to go, we have to have a compilation option to let the user choose the priority list: (rp2, hyperscan, pcre); ...; (hyperscan, re2, pcre). We may support different backends in the future (not only re2 and pcre) and we want the user to freely choose the back end that he opts to, fallback or not.

Another important thing to consider is that the dependency needs be optional. If re2 is not present on the platform, we have to relay on pcre and vice versa. As ModSec internally uses regex, at least one backend should be supported.

Most importantly, if it is runtime, it has to be clear to the user (somehow: debug, error log, warning while loading the rules) which was the engine that was picked to the evaluation, allowing the user to debug and investigate further issues in the regular expression. Otherwise, it will be very difficult to debug the regular expressions.

The way I see is that the runtime implementation is an amelioration of the compile-time option, at least given the architecture presented. Nothing has te bo structural changed to support it, just new code has to be added.

What do you think?

WGH- commented 5 years ago

As far as I understand the syntax may be a little different in some corner cases, but both can accomplish the same mission. <...> In case the rules are not syntax-wise supported and the user wants to use re2, he can always change the rule. Unless some functionality is indeed missing.

As I said in my first message, some functionality is indeed missing. There's extensive reference of RE2 syntax here - https://github.com/google/re2/wiki/Syntax - and it also includes features found in other regular expression engines, but not supported by RE2.

Some of them are purely syntantic ones (like comments, numbered and named (at the same time) groups, etc.), but some features cannot be replaced in general case.

The pull request I mentioned - SpiderLabs/owasp-modsecurity-crs#1255 - marks the regexes not supported by RE2.

For example, <\?(?!xml). This regexp matches <? if it's not followed by xml. Can it be rewritten in RE2? Maybe, for example, like this: <\?(($|[^x])|x([^m]|$)|xm([^l]|$)). The expression becomes is much harder to read, and might be even slower in RE2 than the original negative lookaround version in libpcre (not to mention I'm not 100% sure it's correct, either).

And how, for example, would you rewrite this: (?<!support)@example\.com?

Althought there're no regular expressions with backreferences in the CRS, they might exist in custom rule sets, and would be much more difficult to rewrite, which is also impossible in general case.

My main point is forcing the user to rewrite all his rules to be RE2-compatible is a bad decision, given difficulty of such feat. Very few people would be able to flip the RE2 switch. Retaining libpcre fallback, on the other hand, seems like the best of two worlds to me.

The thing that concerns me the most about this run time thing is the usability. I don't want to the user to read a manual to understand how regular expression works in modsec. It should be simple and natural.

A good thing about RE2 is that if regular expression was compiled successfully, then it will work like it did in libpcre.

If fallback is the way that we want to go, we have to have a compilation option to let the user choose the priority list: (rp2, hyperscan, pcre); ...; (hyperscan, re2, pcre). We may support different backends in the future (not only re2 and pcre) and we want the user to freely choose the back end that he opts to, fallback or not.

It makes sense. Since libpcre supports most features, putting it first would effectively disable all the others, though.