lobsters / lobsters-ansible

Ansible playbook for lobste.rs
ISC License
79 stars 26 forks source link

mariadb was oomkilled #34

Closed pushcx closed 5 years ago

pushcx commented 6 years ago

In #lobsters eeeeeta suggested using OOMScoreAdjust to prevent mysql from being oomkilled again.

Probably we want to turn that score down for mysql, nginx, and/or the master unicorn process, and/or up for unicorn workers. I don't know how much to change these numbers by or where it would be most effective with fewest side effects. Anyone familiar?

jstoja commented 6 years ago

I personally wouldn't touch the OOMScore since the OOMKiller might want to kill system processes (like I don't know sshd or the dns client) making it very annoying to restore anything.

I would rather suggest using the builtin ressource control from systemd (that is using cgroups to do that). It should be easy enough to implement and I would consider this a better practice than adjusting the score.

What do you think about it?

pushcx commented 5 years ago

One of the nice things about prgmr is that we have a serial console we can connect through if ssh or other public networking breaks.

Could you explain more of what you were thinking with this resource control? If it's setting a max for RAM, how will that play out? Will mariadb see a malloc fail? Does it respond by evicting some old cache or printing an informative error message to the logs and exiting?

jstoja commented 5 years ago

So there's 3 main settings for memory control: low, high and max.

Setting up the 3 values for your main process allow you to have "guaranted" process, MariaDB would then be "protected" from the OOM because it would stay under its hard limit, and on the contrary if Unicorn is getting way too much memory, it would reach the hIgh limit that would probably make it harder to reach the max, and if it does reach max then it's probably better to restart it than to kill another process (that might be important) and still allow unicorn to expand more.

pushcx commented 5 years ago

This is probably fixed now, see comment in linked PR. Leaving this open a month or two to confirm.

pushcx commented 5 years ago

Been more than a month, zero issues. Thanks to Scout on this one.