Closed rahulbot closed 6 months ago
There was an alert for if the 5 minute running average of the total of all -in queues was above 200000; I had picked 200K to be conservative. I've lowered that to 50K; this will catch backups at any point in the pipeline. Note: the alerts are on a separate dashboard named "story-indexer alerts".
Click on the bell icon on the left edge to see a list of all the defined alerts
NOTE! Any changes should be exported as JSON and checked in to story-indexer/conf/grafana/story-indexer-alerts.json
Summing in -delay (or a separate alert on the sum of all -delay ("retry") queues) might be helpful to make it visible when there is some failure in a pipeline step (ie; some library routine used by "parser" throws an unhandled exception) that quickly chews thru the -in queue....
Leaving the issue open for discussion (if any)
Closing as good. Thanks.
We recently saw a backlog of URLs waiting to be fetched (via the rabbit-mq
fetcher-in
stat). This is a cause for concern and might indicate system mis-behavior or be an indicator of some other problem. Based on a quick scan of the last 7 days, I suggest we add an alert condition to send us a message if that stat goes above 50,000.(updated with better graph)