add alert when fetcher-in queue size is too big

rahulbot commented 6 months ago

We recently saw a backlog of URLs waiting to be fetched (via the rabbit-mq fetcher-in stat). This is a cause for concern and might indicate system mis-behavior or be an indicator of some other problem. Based on a quick scan of the last 7 days, I suggest we add an alert condition to send us a message if that stat goes above 50,000.

Media_Cloud_—_story-indexer__all_realms__-_Grafana

(updated with better graph)

philbudne commented 6 months ago

There was an alert for if the 5 minute running average of the total of all -in queues was above 200000; I had picked 200K to be conservative. I've lowered that to 50K; this will catch backups at any point in the pipeline. Note: the alerts are on a separate dashboard named "story-indexer alerts".

Click on the bell icon on the left edge to see a list of all the defined alerts

NOTE! Any changes should be exported as JSON and checked in to story-indexer/conf/grafana/story-indexer-alerts.json

Summing in -delay (or a separate alert on the sum of all -delay ("retry") queues) might be helpful to make it visible when there is some failure in a pipeline step (ie; some library routine used by "parser" throws an unhandled exception) that quickly chews thru the -in queue....

Leaving the issue open for discussion (if any)

rahulbot commented 6 months ago

Closing as good. Thanks.

mediacloud / story-indexer

add alert when fetcher-in queue size is too big #286