thegreenwebfoundation / greencheck-api

The green web foundation API
https://www.thegreenwebfoundation.org/
Apache License 2.0
9 stars 3 forks source link

Find way to support runaway memory usage with workers #23

Closed mrchrisadams closed 5 years ago

mrchrisadams commented 5 years ago

We had an incident today where runaway memory usage with workers consuming from the queue with RabbitMQ would eat so much memory in production that it would free the whole box.

We have a few options to catch runaway memory usage to avoid this, but given that we're using supervisord to maintain a pool of workers it's worth looking at superlance, an extension to supervisord that tracks memory usage, to automatically catch process that are using too much memory.

You can see some more guidance here on setting it up an installing, but generally speaking, the approach is:

  1. install with pip install superlance

  2. add a stanza like the one below to the supervisor config file at /etc/supervisor/conf.d/enqueue_greencheck.conf

[eventlistener:memmon]
command=memmon -p <program_name>=3GB
events=TICK_60

We probably need to do this for a group rather than a single process, as we have pool of workers that we care about.

More here:

https://thepracticalsysadmin.com/quicktip-manage-memory-usage-with-supervisord/

https://github.com/corvus-ch/rabbitmq-cli-consumer

mrchrisadams commented 5 years ago

This turned out to be okay to setup.

We now have a listener, so if every 60 seconds, if memory consumption in a enqgreencheck workers is exceeding 500mb, we kill it (that's the TICK_60) bit.

[eventlistener:memmon]
command=memmon -p enqgreencheck=500MB -m support-address@streams.zulipchat.com
events=TICK_60

In more detail:

The name of the stanza as per supervisord

[eventlistener:memmon]

This is the name of the command to call

command=memmon

The memory threshold we check for:

  -p enqgreencheck=500MB

The address we email a notification to when it happens, in our case our zulip chat room:

  -m support-address@streams.zulipchat.com

Do this check every 60 seconds:

events=TICK_60