owasp-modsecurity / ModSecurity

ModSecurity is an open source, cross platform web application firewall (WAF) engine for Apache, IIS and Nginx. It has a robust event-based programming language which provides protection from a range of attacks against web applications and allows for HTTP traffic monitoring, logging and real-time analysis.
https://www.modsecurity.org
Apache License 2.0
7.7k stars 1.54k forks source link

NginX binary with modsecurity consume all memory and crashes #1563

Closed intelbg closed 6 years ago

intelbg commented 6 years ago

Hello again and sorry for the inconvenience, I am using libmodsecurity V3 (nginx_refactoring branch) + nginx connector with CRSv3 and when I put this binary of nginx in production with around 150 -200 vhosts, after I make even service nginx configtest (or restart) the server is on 150-200 load immediately (strace show that there are a lot of wait4 sys calls) and the process takes around 20-30 seconds. I tried to remove the CRS rules folder to see if the problem is in the number of lines/files or in the binary but no big difference. With binary without modsec everything is ok. Also, with the modsec nginx binary there is a memory allocation problems on restart and nginx does not want to start. Removed all limits on the operating systems including cgroups, ulimit, sysctl etc but no effect. If I back the binary without modsec everything is fine.

Is here anyone that faced such a problems and is there a chance to:

  1. The binary to be not compiled with the right options, by the right way or something other?
  2. The nginx to trigger all the modsec functions on restart even when mod security on is not present on the vhosts?
  3. I tried to optimize everything that I found but it's the same.

The machines are with 32GB of ram, 12 or 16 cpu cores and normal HDDs. On SSD it's a little bit better.

met3or commented 6 years ago

Hi intelbg,

I've so far been unable to replicate the issue that you're experiencing with my setup, I'm compiling the v3/master branch of ModSecurity (with a few tweaks to work-around SELinux but nothing major) and thrown an arbitrary 1000 vhosts at nginx, reloaded & restarted fine albeit slow.

The machine being used is physical, 32G RAM with 12 core cpu aswell as SSD for disk.

For you to reach the load averages you mention are you performing any specific tasks or are the vhosts particularly active?

intelbg commented 6 years ago

Thank you for your reply! I didn't perform any specific tasks - only yum update of the nginx and service nginx configtest or restart. The vhosts are particularly active, but with nginx without modsec there is not an issue. Maybe for every request by some way there is an overhead even if modsecurity is not on for the ghosts? Some calculations in the nginx binary?

met3or commented 6 years ago

Hi @intelbg ,

The domains I've added aren't in use so perhaps that is the reason for being unable to replicate, I will attempt to run some tests against these hosts and get back to you soon if I can replicate it or not.

intelbg commented 6 years ago

Thank you again. Meanwhile, is it possible the problem to come from different versions of gcc, g++, make, automate on the compile server and the production server to which transfer the binary later? Or in the parameters of the machine of the compilation server and the production servers (production servers are 10 times faster than this for the compilation)? But I compiled before that nginx in the same environment and without modsec no problems.

met3or commented 6 years ago

Hi @intelbg - Interestingly, my server load has not massively increased but it has exhausted it's memory and I'm getting a -

fork: Cannot allocate memory error from the system.

I will look at this a bit more now and see if I can see any obvious culprits.

intelbg commented 6 years ago

Did you found anything unexpected?

met3or commented 6 years ago

Hi @intelbg - I've not found the reason for this yet, but with further tests I have found an nginx out-of-memory event in the system messages. I will be trying to trace the cause of this, but from my preliminary checks It seems to be due to the large quantity of configs.

intelbg commented 6 years ago

But why this affects the system loads as these configs has not modsecurity enabled respectively they does not invoke modsecurity functions?

sethinsd commented 6 years ago

I am having this issue currently. we have about 1500 vhosts. Works fine as long as I don't send any big traffic at it, but with a moderate 20requests per second or so this is what the log ends up looking like the following (I've fixed the PCRE limits exceeded issue, but still having it crash out


2017/09/13 12:24:50 [alert] 1556#1556: worker process 3665 exited on signal 11 (core dumped)
2017/09/13 12:24:50 [alert] 1556#1556: worker process 3667 exited on signal 11 (core dumped)
2017/09/13 12:24:51 [alert] 1556#1556: worker process 3661 exited on signal 11 (core dumped)
2017/09/13 12:24:52 [alert] 1556#1556: worker process 3671 exited on signal 11 (core dumped)
2017/09/13 12:24:53 [alert] 1556#1556: worker process 3669 exited on signal 11 (core dumped)
2017/09/13 12:24:53 [alert] 1556#1556: worker process 3673 exited on signal 11 (core dumped)
2017/09/13 12:24:54 [alert] 1556#1556: worker process 3677 exited on signal 11 (core dumped)
2017/09/13 12:24:55 [alert] 1556#1556: worker process 3675 exited on signal 11 (core dumped)
2017/09/13 12:24:56 [alert] 1556#1556: worker process 3681 exited on signal 11 (core dumped)
2017/09/13 12:24:56 [alert] 1556#1556: worker process 3685 exited on signal 11 (core dumped)
met3or commented 6 years ago

Hi @sethinsd - For the error you're seeing here:

2017/09/13 12:24:48 [error] 3659#3659: [client 172.31.19.26] ModSecurity: Rule 7f4136252c70 [id "-"][file "/etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-951-DATA-LEAKAGES-SQL.conf"][line "272"] - Execution error - PCRE limits exceeded (-8): (null). [hostname "ip-172-31-8-158"] [uri "/"] [unique_id "AcA8AIAcAcANtcAcAcAc8cAc"]

You may need to configure the following settings:

SecPcreMatchLimit 500000
SecPcreMatchLimitRecursion 500000

To a value of your requirements, the reason this has a limit is to protect from DoS type attacks.

sethinsd commented 6 years ago

ah yes I fixed that one already. But the other issue is persisting with it crashing under load

intelbg commented 6 years ago

@sethinsd how did you fixed it? Can you please share all of your experience? I am not sure why until now nobody has reported these problems as they are serious and now I hope as this number of people is increasing to find a solution. If I can provide another information that can be useful just let me know.

sethinsd commented 6 years ago

I only fixed that one PCRE issue - the crashes and memory with fork problem persists and I cannot enable modsecurity in my production environment right now. I think it's likely related to both of us having so many host configuration files. I have about 1600 server {} declarations if not more.

sethinsd commented 6 years ago

One thing we are noticing in our setup even with modsec disabled is the active connections climb to unreal levels over a few days until we restart nginx non gracefully. Like upwards of 60k when really we only have 100-200 active connections. So something is seriously wrong with the nginx compile.

intelbg commented 6 years ago

Any progress found?

Tiki-God commented 6 years ago

I've been looking in to the same or very similar issue with @met3or and one of the other guys in the office and we have some thoughts on the cause of the memory usage but could do with someone more familiar with the code base commenting on this as we could be way off the mark.

We found here that every time a new vhost is parsed that this chunk of memory is malloc'd even if modsec isn't explicitly enabled for the domain. We've modified the source so that there's only a single shared instance of unicode_map_table. No idea yet if this is sensible but I'm trusting my colleague for now ;)

Got to do a whole bunch more testing with it to see what explodes, but happy to provide a diff if others want to test in their dev environments.

Tiki-God commented 6 years ago

Very rough and not very well tested patch attached.

memory.zip

sethinsd commented 6 years ago

What happened to the comment regarding :in /modsecurity/nginx/modsecurity/apr_bucket_nginx.c,at line 217 and 219 add cl->next = NULL;

I did this and recompiled nginx and gave it a bunch of load last night to test and it seemed to hold up without the issue that was crashing it previously. This might be a true fix. If you want us to try your patch in the above file, we can do so, but at the moment it looks like that change works for us. Granted, we only put this instance into our load balancer for 5-10mins or so, but it held up.

Tiki-God commented 6 years ago

Hi @sethinsd, maybe I should have created a new issue with my comment. I was trying to avoid creating unnecessary extra issues and this seemed relevant so apologies if I'm just confusing the matter.

The patch I attached is aimed at fixing the memory usage with a large number of domains configured. Specifically in the example we're testing in a dev environment. With 1000 domains configured, even with 'modsecurity off' set, the resident memory usage after starting Nginx is in the region of 1.8GB and seems to increase roughly linearly with the number of configured domains/server{} blocks. This also impacts the restart/reload/configtest times of Nginx. Our goal is to see what happens when Nginx is configured with 10x-20x that number of domains.

intelbg commented 6 years ago

@Tiki-God I work in real production environment (shared hosting env), so can you please confirm to me that this is the right step I should follow to apply your patch to test it and will be this applied to some of the branches so I re-pull the branch?

  1. cd ..../headers/modsecurity/
  2. patch rules.h < memory.patch
  3. Then recompile libmodsecurity and nginx connector

If there is an mistake in my steps please provide me the right one to test and confirm it works. Thank you in advance!

Tiki-God commented 6 years ago

@intelbg I would not put this in a live environment yet as it hasn't been tested properly. We're still testing in our dev environment to see if we can trigger any odd behaviour.

intelbg commented 6 years ago

@Tiki-God what about the steps? Are they the right one?

intelbg commented 6 years ago

[root@XXXXX modsecurity]# patch rules.h < memory.patch (Stripping trailing CRs from patch.) patching file rules.h (Stripping trailing CRs from patch.) patching file rules.h Hunk #1 FAILED at 31. Hunk #2 FAILED at 78. 2 out of 2 hunks FAILED -- saving rejects to file rules.h.rej [root@XXXX modsecurity]# pwd /usr/local/src/ModSecurity/headers/modsecurity

Tiki-God commented 6 years ago

@intelbg this is what I just did to check it's patching fine. After that I rebuild and install.

git clone https://github.com/SpiderLabs/ModSecurity.git cd ModSecurity/ git checkout v3/master patch -p1 < ~/memory.patch

sethinsd commented 6 years ago

Just to report back again - I tried giving a server our full load last night, and the problem I was experiencing returned. Although, everything seemed to be working ok, I would get these alerts too frequently to feel comfortable. Probably some clients out there were just dropping connection and retrying. 2017/09/27 05:33:38 [alert] 1415#1415: worker process 1643 exited on signal 11 (core dumped) 2017/09/27 05:33:38 [alert] 1415#1415: worker process 1645 exited on signal 11 (core dumped) 2017/09/27 05:33:39 [alert] 1415#1415: worker process 1647 exited on signal 11 (core dumped) 2017/09/27 05:33:40 [alert] 1415#1415: worker process 1641 exited on signal 11 (core dumped) 2017/09/27 05:33:41 [alert] 1415#1415: worker process 1649 exited on signal 11 (core dumped) 2017/09/27 05:33:41 [alert] 1415#1415: worker process 1651 exited on signal 11 (core dumped) 2017/09/27 05:33:41 [alert] 1415#1415: worker process 1654 exited on signal 11 (core dumped) 2017/09/27 05:33:42 [alert] 1415#1415: worker process 1657 exited on signal 11 (core dumped) 2017/09/27 05:33:45 [alert] 1415#1415: worker process 1655 exited on signal 11 (core dumped) 2017/09/27 05:33:53 [alert] 1415#1415: worker process 1659 exited on signal 11 (core dumped) 2017/09/27 05:33:54 [alert] 1415#1415: worker process 1661 exited on signal 11 (core dumped) 2017/09/27 05:33:54 [alert] 1415#1415: worker process 1664 exited on signal 11 (core dumped) 2017/09/27 05:33:56 [alert] 1415#1415: worker process 1668 exited on signal 11 (core dumped)

With ModSec off, this problem does not exist. And, this problem exists without any includes/rulesets turned on.

intelbg commented 6 years ago

@Tiki-God I receive the following errors when compiling libmodsecurity after applying your patch: I am using the nginx refactoring branch.

rules.cc:35:13: error: ‘int modsecurity::Rules::unicode_map_table’ is not a static member of ‘class modsecurity::Rules’ int Rules::unicode_map_table = unicode_map_table_data; ^ rules.cc:55:52: error: redefinition of ‘int modsecurity::unicode_map_table_data [262144]’ static int unicode_map_table_data[sizeof(int)65536] = {-1}; ^ rules.cc:34:12: error: ‘int modsecurity::unicode_map_table_data [262144]’ previously defined here static int unicode_map_table_data[sizeof(int)65536] = {-1}; ^ rules.cc:56:13: error: ‘int modsecurity::Rules::unicode_map_table’ is not a static member of ‘class modsecurity::Rules’ int Rules::unicode_map_table = unicode_map_table_data; ^ make[3]: *** [libmodsecurity_la-rules.lo] Er

Cloaked9000 commented 6 years ago

@intelbg It doesn't look like the patch has been applied to rules.h.

int *unicode_map_table; should be: static int *unicode_map_table;

And the contents of the constructors (the stuff between the curly braces) on lines 50 and 58 should be deleted.

zimmerle commented 6 years ago

Hi,

Lets make sure that we are all talking about the same code base here. libModSecurity is under the v3/master branch and demands the ModSecurity-nginx connector.

The _nginxrefactoring branch is not version 3. At _nginxrefactoring you can find some patches on top of v2.x but we don't recommend you to use it. Instead we suggest libModSecurity + ModSecurity-nginx project.

For instance, the file apr_bucket_nginx.c is part of v2.x only while the file rules.h is v3.x.

With that said, lets try to understand if the reported bug happens on v3, as we are suggesting the users to move forward it is likely that, if there is a bug, it won't be a fix for version 2.x

The error message: Execution error - PCRE limits exceede is part of version 2.x; it won't affect v3.

@Tiki-God the memory you pointed out maybe an issue, but is unlikely to affect this specific bug because you found an issue in v3, while the bugs reported are v2-related [at least given the error messages]. So, let me ask you guys to make sure that you are running the version 3, as pointed out here:

once we are all in v3, we can continue debug the issue, if any. How that sounds?

sethinsd commented 6 years ago

Ah that sounds great, apologies. I was compiling off 2.9.2. I'll pull down 3 and try soon and report back - my issue may not have been related to intelbg's because of these differences.

zimmerle commented 6 years ago

Thanks @sethinsd, looking forward for your feedback ;)

sethinsd commented 6 years ago

compiled with nginx (without patch provided above) and when trying to start nginx with modsecurity, it just failed to start. When I turned off the modsecurity on; directive, and tried to start nginx, system basically hung for quite a while. Eventually hit control c and enter, system didn't respond... after another min or two the control c finally came through, checked status and it was the same:


Sep 29 23:14:55 ip-10-3-3-136 systemd[1]: nginx.service: Control process exited, code=killed status=9
Sep 29 23:14:55 ip-10-3-3-136 systemd[1]: Failed to start The NGINX HTTP and reverse proxy server.
Sep 29 23:14:55 ip-10-3-3-136 systemd[1]: nginx.service: Unit entered failed state.
Sep 29 23:14:55 ip-10-3-3-136 systemd[1]: nginx.service: Failed with result 'signal'.```
sethinsd commented 6 years ago

patch also failed similarly for me, so edited the rules.h manually removing the malloc stuff between curly braces in constructors, verified that it was declared static int, recompiled and installed, tried running without modsecurity on; and it's still stalling. previously I was running the old 2.9 configured with --enable-standalone-module which worked but core dumped occasionally. with v3 I can't even get it to start without it enabled, probably due to the 1300+ domain conf files I have. Eventually after running sudo nginx and waiting a few minutes it just says Killed

zimmerle commented 6 years ago

Hi @sethinsd,

How is your ModSecurity configuration? Do you have a global ModSecurity configuration for all hosts or each one loads its own ModSecurity configuration?

intelbg commented 6 years ago

I am with modsec V3 and I would like to confirm that until now there isn't a problem with memory leaks on the heavy servers with modsec disabled for the vhosts. I upgraded one more server with the patch and if there is no problem on it too I will active some vhosts to check the situation there. Thank you about the support!

intelbg commented 6 years ago

I found the following thing after implementing the patch only on the servers with the mod security and the patch. On nginx reload it dies, but the pid file exists. This is the history:

activeshell@XXX1:~$ service nginx status nginx (pid 4343) is running... Reloading nginx: [ OK ] activeshell@dino:~$ service nginx status nginx dead but pid file exists

Server 2 with nginx and modsec:

activeshell@XXX2:~$ service nginx status nginx (pid 764) is running... activeshell@XXX2:~$ service nginx reload Reloading nginx: [ OK ] activeshell@XXX2:~$ service nginx status nginx dead but pid file exists

And this is the result where the patch and the mod security at all does not exists:

activeshell@working:~$ service nginx status nginx (pid 4130) is running... activeshell@working:~$ service nginx reload Reloading nginx: [ OK ] activeshell@working:~$ service nginx status nginx (pid 4130) is running...

Cloaked9000 commented 6 years ago

@intelbg Could you attach your rules.h and rules.cc file?

sethinsd commented 6 years ago

@zimmerle well, it was crashing even without modsecurity on; but I was attempting to put it in the http {} block, not any particular server or location.

intelbg commented 6 years ago

@Cloaked9000 I attach the two files:

rules.cc.txt rules.h.txt

zimmerle commented 6 years ago

@sethinsd, opened the issue #1579 so we can continue discussing about the crash while loading nginx when ModSecurity is compiled.

Cloaked9000 commented 6 years ago

@intelbg Yeah, I can see why it's crashing. The changes have been made to rules.h but not rules.cc fully.

The code in the rules.cc destructor should look like this:

Rules::~Rules() {

}

The patch statically defines unicode_map_table, instead of allocating it on the heap. So trying to free the memory as if it were on the heap is undefined behaviour. This is what causes Nginx to crash on reload/stop, as that's when this code is run.

On the topic of destructors, @sethinsd You might want to look at marking RulesProperties' destructor as virtual, because Rules inherits it. I've not looked at the code close enough to tell if this will make a practical difference, but I'd definitely make the change.

Edit: Class Driver in parser/driver.h also inherits RulesProperties and defines a destructor.

I'll attach my versions of rules.h and rules.cc, with the patch applied. Here: rules.zip

intelbg commented 6 years ago

@Cloaked9000 I removed the two lines in the block you provided and recompiled. Now it seems to work. Thank you really about the great support. If I find something other I will write here. Have a nice day!

AnoopAlias commented 6 years ago

I can confirm that this patch fixes SpiderLabs/ModSecurity-nginx#67 ( Mod_sec v3) . Tested this on a server with around 6000 domains and nginx -t and nginx -s reload completes just fine. So without merging this v3 is definitely unusable on a server with more than 20-30 domains. @Tiki-God - Looks like your patch file has some issue merging the changes. the full replacement of rules.h and rules.cc provided by @Cloaked9000 worked just fine.

Cloaked9000 commented 6 years ago

@intelbg @AnoopAlias Just a heads up, neither me nor @Tiki-God are affiliated with Mod Security, we just had the same issue. If you're going to test the patch, do so with unicode URLs

zimmerle commented 6 years ago

Hi,

The initially investigation that pointed the unicode malloc as the guilt for the mass memory consumption was indeed true. The proposed solution, however, may not be the best approach.

Every single location or { } block inside nginx uses a Rules instance, regardless having ModSecurity rules or not. That is normal and expected, however, allocating a huge block on that structure is doomed to failed. Keeping the Rules object light is the key for the success of that approach. Instead of removing the unicode support, the patch on the main tree suggest the utilization of a structure which is only occupies the memory when the configuration flag is used. In practice, for those whom are not using the unicode map, is the same as remove it.

Changes are available here: 1ad95254cd8caec4a0af83d01844fc3cc65f3ce7

AnoopAlias commented 6 years ago

Just tested this using the official patch in v3/Master and things look ok. nginx reload and restart works fine