Open tom-j-smith opened 10 months ago
Created change in node count alert (using click-ops) created SNS topic subscribed to SNS topic configured auto scaling group to send notifications tested notification by increasing the 'Desired capacity' slack message received in lower-priority-alarms channel
Background
We recently split the ChangeInNodeCount alert in prometheus into increase and decrease alerts to work around a prometheus/alert manager to get more messages into the low-priority-alarms channel when the alerts fire. There are still a few issues with using prometheus for this alert, one of the main ones being that if an alert is already in firing due to an instance scaling up, then if more nodes scale up before the alert has reset then it just extends the firing status and no new alerts would be sent. This could be an issue if something gradually causes the cluster to increase the node count over the correct period, e.g. the alert may trigger and send a message saying that one or two pods have spun up, but as more nodes are spun up it would keep the alarm in firing and not send any other messages.
This firebreak ticket would be used to investigate if we can create a change in node count alert from a different source e.g. aws cloudwatch, to send notifications every time the cluster rather than waiting for an alert to fire and then sending a message.
Proposed user journey
Approach
Something similar to this AWS article
Which part of the user docs does this impact
Communicate changes
Questions / Assumptions
Definition of done
Reference
How to write good user stories