ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
86 stars 44 forks source link

FIREBREAK: Investigate ChangeInNodeCount alert from Cloudwatch #5038

Open tom-j-smith opened 10 months ago

tom-j-smith commented 10 months ago

Background

We recently split the ChangeInNodeCount alert in prometheus into increase and decrease alerts to work around a prometheus/alert manager to get more messages into the low-priority-alarms channel when the alerts fire. There are still a few issues with using prometheus for this alert, one of the main ones being that if an alert is already in firing due to an instance scaling up, then if more nodes scale up before the alert has reset then it just extends the firing status and no new alerts would be sent. This could be an issue if something gradually causes the cluster to increase the node count over the correct period, e.g. the alert may trigger and send a message saying that one or two pods have spun up, but as more nodes are spun up it would keep the alarm in firing and not send any other messages.

This firebreak ticket would be used to investigate if we can create a change in node count alert from a different source e.g. aws cloudwatch, to send notifications every time the cluster rather than waiting for an alert to fire and then sending a message.

Proposed user journey

Approach

Something similar to this AWS article

Which part of the user docs does this impact

Communicate changes

Questions / Assumptions

Definition of done

Reference

How to write good user stories

tmahmood72 commented 1 month ago

Created change in node count alert (using click-ops) created SNS topic subscribed to SNS topic configured auto scaling group to send notifications tested notification by increasing the 'Desired capacity' slack message received in lower-priority-alarms channel