ministryofjustice / modernisation-platform

A place for the core work of the Modernisation Platform • This repository is defined and managed in Terraform
https://user-guide.modernisation-platform.service.justice.gov.uk
MIT License
680 stars 290 forks source link

Notification/alerting system in Modernisation Platform #6817

Closed ewastempel closed 2 months ago

ewastempel commented 5 months ago

User Story

As a Modernisation Platform member user I want to be notified about health of resources or upcoming events that need actioning (e.g. expiring certificates) So that I can react and fix/prevent an issue.

Value / Purpose

Healthy application/system means no outages.

Additional Information

This want came as an ask channel request and it is to look if we can implement a new one or reuse our existing alerting system that can be easily consumed by the members.

Currently MP alerting workflow is made of: CloudWatch -> SNS -> PagerDuty -> Slack and using this solution is documented here.

This ticket is to remove the need of PagerDuty acting as a mid-man and to integrate with a variety of resources (CloudWatch, EventBridge, SNS) rather than being limited to one only (although it could start with one and then build on it).

The user that requested this, suggested EventBridge → SNS → e-mail → Slack as an approach described in here, which could be considered.

Definition of Done

dms1981 commented 4 months ago

Is this potentially too broad? Is this ticket meant to cover the creation of a new alerting/notification module that we can use, or a one-off to cover alerting when certificates are reaching their expiration date that could later be extended to replace PagerDuty as a middleman?

Is this something that customers are presently empowered to do without us being involved?

dms1981 commented 4 months ago

As you noted in Slack, @ewastempel , maybe this is a better fit for enrolling with Observability Platform and getting the information through there?

richgreen-moj commented 2 months ago

🤔 For this ticket I'm thinking of creating a generic module (perhaps called modernisation-platform-aws-health-events) that creates an eventbridge rule that monitors aws health events , posts these to an SNS topic which can then either be hooked in to by email or perhaps even Slack with AWS ChatBot - as described here

This would capture the needs of the user as certificate renewals are posted as health events but would also serve as a more generic tool for users to configure alerts for other important health events.

If possible perhaps the module could be configurable to point at particular services rather than all. We'll see

ewastempel commented 2 months ago

To answer @dms1981 question, this is to:

  1. Remove the need of the PD acting as a mid man as stated in the description and the DoD. In the absence of Observability Platform functionalities, this can be implemented as a tf module.
  2. Fix the user's problem

@richgreen-moj I am not fully aware of AWS health check capabilities and limitations, so your plan sounds fine in theory, but I would like you to implement it in the above order (1, then 2). This means it should integrate with a variety of AWS services, networking, IAM, lambda, certificates. Therefore rather than resolving the problem for the user first, make sure it resolves the problem for us (we want to replace our existing alerting that uses PD to use this new solution).

ewastempel commented 2 months ago

Further to my previous comment, if the 1st is not achievable, the 2nd shouldn't be implemented. However it is still worth doing 1st, even if the 2nd cannot be achieved.

richgreen-moj commented 2 months ago

So current process is...

  1. Create SNS topic
  2. Create Alarm (which notifies topic)
  3. Integrate to PagerDuty (fun and games)

If we the aim is to bypass PD then we could use AWS ChatBot and I could use this TF resource in a module that can be called where you can provide a list of existing topics etc that you want to get notified on via Slack.

I think there would be a manual element to setting up a chatbot slack client per account but once that's done you can do the rest in code.

richgreen-moj commented 2 months ago

This PR https://github.com/ministryofjustice/modernisation-platform-terraform-aws-chatbot/pull/1 provides the detail on the new AWS Chatbot module which I have tested out in Sprinkler as well as writing some unit tests for the module.