turnkeylinux / tracker

TurnKey Linux Tracker
https://www.turnkeylinux.org
70 stars 16 forks source link

Leverage Monit to restart Apache when MySQL is killed by OOMKiller (was Need OOM Killer Protection) #276

Open OnePressTech opened 9 years ago

OnePressTech commented 9 years ago

[update by @jedmeister] I have renamed this issue to reflect the discussed path to resolve the problem. Resolution of this issue would be a component of #542, although should be something of a priority in and of itself IMO.


At present when there is a heavy run on the apache server in a LAMP appliance it sucks up all the memory and the OOM killer kills processes to free memory up. The process that usually gets killed is the database (OOM Killer terminates processes by size and priority).

While the apache server is configured to restart itself if it is killed, for some reason the database is not configured to restart itself.

Suggestion (would be nice to have in the upcoming 13.1 / 14 release time permitting):

1) configure apache settings automatically via calculation based on average apache child size and capped percentage of physical memory taking into consideration the size of the database (as database gets bigger the memory available for apache for concurrent requests is reduced).

2) set the database to restart itself when it dies (at least 1 retry)

3) set the OOM priority high on the database so it is the last thing to be killed under a low memory situation (i.e. could first reduce apache memory allocation, could set OOM priority settings for webmin, phpmyadmin, and webshell to be lower than the database so they are killed before the database...then a cron job could restart them as memory settles back to a steady state).

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

JedMeister commented 8 years ago

As of v14.0 we added Monit; so perhaps that could be leveraged to provide the MySQL restart?!

I'm going to peg this to v14.1 for now and hopefully we can include something. v14.1 dev is starting now(ish)...

OnePressTech commented 8 years ago

The trick with a MySQL restart is that it is intertwined with the Apache memory hogging. We won't just be able to do a reset when MySQL is down because the underlying reason it was killed by the OOM Killer was lack of memory due to Apache hogging. Apache holds its memory so if we just restart MySQL the OOM Killer will just kill it again on the next web server access. Since this situation usually emerges under extreme load that is likely to mean immediately. So the simplest reset is actually a bit more involved than a simple "restart MySQL when down" trigger. The watchdog / event script will need to check if Apache is hogging memory and, if so, restart apache, then wait for the memory to get freed above the OOM Killer level, then restart MySQL.

The alternative is to set the OOM Killer priorities so that MySQL is higher priority than Apache so the OOM Killer kills Apache and leaves the MySQL alone. Then the Apache will auto-restart at a low memory level.

Just restarting MySQL without knowing why it was down risks a MySQL restart loop which could render the VM inaccessible.

JedMeister commented 8 years ago

Great info, thanks Tim. I understand it better now.

The only downside of that is that I'm now not as sure of the best path to provide this sort of protection. Regardless I'll leave it pinned to v14.1 and we'll see how we go. If you have any specific suggestions then (as per always) I'd love to hear them! :smile:

OnePressTech commented 8 years ago

I'm about to tackle a TKLDev build over the Christmas / New Years period and this is one of a few issues I was planning on tackling. I'll keep you posted.

JedMeister commented 8 years ago

Nice! :+1: PS if I don't speak to you before hand; merry Christmas :christmas_tree: :santa: and happy new year!

JedMeister commented 8 years ago

@OnePressTech - I spoke with Alon about this last night. Essentially he suggested that as a general rule, the reason why the OOM Killer was acting is that the server has insufficient resources. If that's not the cause then there is a bug (probably a memory leak) somewhere else.

On reflection I thought that was actually a really good point.

Alon thought that documenting enabling some sort of OOM KIller protection is good idea (for those trying to minimise their hosting costs and who might just be on the edge of their server's capacity) but was generally against default inclusion. He was definitely against having it enabled by default (even if we were to include it).

What say you?

OnePressTech commented 8 years ago

On a server with Apache fronting for php it is not an issue of insufficient resources it is an issue of an imbalance of resources. If you constrain Apache too much you end up with underutilised RAM and CPU. If you tune it up it will work fine for a time but if the php process size increases the resource balance becomes an imbalance and the OOM killer terminates the largest resource which is usually the database. I don't have experience with Drupal php process size growth but WordPress php processes grow as you add more plug-ins. So it's a ticking bomb waiting to go off. Should be addressed I think.

With TKLX targeted to a DIY crowd I would think single server configurations would be high in number. I know I deploy single server solutions for most of my clients. The cost / value of a multi-server replicating architecture is too high for an SMB price-point as is the price doubling in moving up a VM size.

Alon's caution is reasonable. I would suggest a few options for consideration:

1) Park the auto-resource balance for now and focus on giving the database a higher OOM priority so that it is not killed when web server resources get low. That is a benign modification.

2) If option 1 is still considered a risk, consider only installing this adjustment on appliances with an Apache web server for now.

For the record I will be working on the auto-resource balance algorithm at some point in the near future. TKLX can re-evaluate in the 15.0 release once it has been shaken out in the field.

JedMeister commented 8 years ago

I have done a fair bit of reading on this now and IMO I have a much better understanding.

I think some sort of Apache tuning solution is potentially a nice one (although may only be part of the solution). Also some sort of auto-resource balance algorithm could be a good thing too.

Although I actually think that the specific cause of the issue may come from configuration and usage factors that cause the OOM adjustment score to be sub-optimal for MySQL on a TurnKey appliance. Perhaps TKLBAM is a factor in this? From what I understand TKLBAM does a SQL dump of each DB and by default MySQL stores the whole dump in RAM until it is written to disk (thus massively increasing MySQL RAM usage).

If I am understanding correctly, I would actually expect under normal conditions that Apache would have a higher "badness" score than MySQL. And as OOM killer should kill child processes before parents (for forked processes), having the OOM killer killing Apache (child processes) would be much preferable to killing MySQL (which is threaded, thus killing the one and only MySQL process...)

It also appears that by the time we get to MySQL 5.7 we will need to implement some proper solution for this. v5.7 supports SystemD auto restarts (see here) so under current circumstance we could likely have a situation where the MySQL process will keep crashing and restating....

Also compounding that is that AFAIK as of v5.6 (with default settings) the minimum RAM requirements for MySQL ramp up to 1GB - although with performance gains to match (sorry no reference for this; don't recall where I read it). For 5.5 I have read that with default settings, minimum is 512 MB although I haven't found that first hand. As a side note I found a mysql-calculator that suggests that using defaults results in 576.2 MB usage. I have no idea what version that applies to though (and/or whether it's the default settings that significantly increase RAM usage in newer versions or other architectural design factors).

Anyway, I digress...

From my reading, ensuring that MySQL has a lower OOM score (and/or that Apache has a higher one) would be a good start for this. My research concurs with your assertion that this is a benign mod. I just have to convince Alon that it's actually an issue that we should address...

TBH I'm also unclear on the best way to do this. As these things are calculated by the kernel on the fly, the only way that I can imagine this would work is via a cron script. IMO it would need to check and adjust the OOM adjustment factor; although it's possible that I'm missing something...

Off on a tangent; I was amazed by a suggestion by Oracle that the OOM killer could be configured to cause a kernel panic and induce a reboot when processes are killed! The post did acknowledge that it wasn't particular elegant; but to even suggest it I found a little odd...

Also re MySQL; perhaps there may be some value in using mysqltuner? I wonder if we could leverage that somehow to make it easier for newbs to tune their DBs?

OnePressTech commented 8 years ago

You're getting the hang of it now :-)

JedMeister commented 8 years ago

I've spoken with Alon about this some more and still not really getting any traction. He essentially argued that there is a bug somewhere and/or this is a system resource issue. In some respects he's right, but IMO proving a work around (as per our discussions) would still be useful.

Alon suggested that a better long term approach to resolve the issue would be to swap out Apache for Nginx (i.e. base most apps on the nginx-php-fastcgi aka LEMP appliance rather than LAMP). Perhaps he's got a point there?

Regardless I'm going to have to retarget this issue to v14.2. Sorry Tim, best I can do for now...

OnePressTech commented 8 years ago

Thanks Jed. Please pass on to Alon that this is not a bug situation. This is a design issue...what to do when you run out of RAM. I'm not sure why the suggestion to adjust the default OOM Killer priority on the MySQL database is such a challenge to get on board.

Switching to NGINX does not change the issue. Not sure why you think it would. Nginx vs. Apache is only relevant for static elements...not dynamic elements. Same OOM-killer issue applies.

JedMeister commented 7 years ago

This may actually be better managed by Monit? I suspect that something like restarting Apache and MySQL if MySQL crashes would be pretty good at resolving this specific issue. If that seems reasonable, then it should be considered part of #542

JedMeister commented 6 years ago

I was tempted to move this to v15.1 milestone as again we're running way behind schedule and need to focus on the high priority items which don't have too much risk of causing issues.

However, hopefully this will be somewhat mitigated by the implementation of #1015 although configuring monit to deal with it is also a really good idea (on top of using php-fpm).

So for now, I'm leaving it pinned to v15.0.

JedMeister commented 6 years ago

As #1015 got moved to v15.1 and this (again) hasn't been addressed in v15.0, I'm going to move this to v15.1.

IMO we should make #542 (improving the Monit integration) something of a priority for v15.1! At the very least to resolve this particular issue.

As I may have already noted, Alon is not at all keen to actually adjust the OOMKiller itself, but using Monit to manage this is a good compromise IMO.

Actually, I'm also going to rename this issue and add a note to the OP re using Monit.